diff --git a/design/BulkAPI.md b/design/BulkAPI.md new file mode 100644 index 000000000..bd7327c89 --- /dev/null +++ b/design/BulkAPI.md @@ -0,0 +1,254 @@ +# Bulk API Documentation + +Bulk is an API designed to provide resource optimization recommendations in bulk for all available +containers, namespaces, etc., for a cluster connected via the datasource integration framework. Bulk can +be configured using filters like exclude/include namespaces, workloads, containers, or labels for generating +recommendations. It also has settings to generate recommendations at both the container or namespace level, or both. + +Bulk returns a `job_id` as a response to track the job status. The user can use the `job_id` to monitor the +progress of the job. + +## Task Flow When Bulk Is Invoked + +1. Returns a unique `job_id`. +2. Background Bulk: + - First, does a handshake with the datasource. + - Using queries, it fetches the list of namespaces, workloads, containers of the connected datasource. + - Creates experiments, one for each container *alpha release. + - Triggers `generateRecommendations` for each container. + - Once all experiments are created, and recommendations are generated, the system marks the `job_id` as "COMPLETED". + +## API Specification + +### POST /bulk + +**Request Payload (JSON):** + +```json +{ + "filter": { + "exclude": { + "namespace": [], + "workload": [], + "containers": [], + "labels": {} + }, + "include": { + "namespace": [], + "workload": [], + "containers": [], + "labels": { + "key1": "value1", + "key2": "value2" + } + } + }, + "time_range": {}, + "datasource": "Cbank1Xyz", + "experiment_types": [ + "container", + "namespace" + ] +} +``` + +**filter:** This object contains both exclusion and inclusion filters to specify the scope of data being queried. + +- **exclude:** Defines the criteria to exclude certain data. + - **namespace:** A list of Kubernetes namespaces to exclude. If empty, no namespaces are excluded. + - **workload:** A list of workloads to exclude. + - **containers:** A list of container names to exclude. + - **labels:** Key-value pairs of labels to exclude. + +- **include:** Defines the criteria to include specific data. + - **namespace:** A list of Kubernetes namespaces to include. + - **workload:** A list of workloads to include. + - **containers:** A list of container names to include. + - **labels:** Key-value pairs of labels to include. + +- **time_range:** Specifies the time range for querying the data. If empty, no specific time range is applied. + +- **datasource:** The data source, e.g., `"Cbank1Xyz"`. + +- **experiment_types:** Specifies the type(s) of experiments to run, e.g., `"container"` or `"namespace"`. + +### Success Response + +- **Status:** 200 OK +- **Body:** + +```json +{ + "job_id": "123e4567-e89b-12d3-a456-426614174000" +} +``` + +### GET Request: + +```bash +GET /bulk?job_id=123e4567-e89b-12d3-a456-426614174000 +``` + +**Body (JSON):** + +```json +{ + "status": "COMPLETED", + "total_experiments": 23, + "processed_experiments": 23, + "job_id": "54905959-77d4-42ba-8e06-90bb97b823b9", + "job_start_time": "2024-10-10T06:07:09.066Z", + "job_end_time": "2024-10-10T06:07:17.471Z" +} +``` + +```bash +GET /bulk?job_id=123e4567-e89b-12d3-a456-426614174000&verbose=true +``` + +**Body (JSON):** +When verbose=true, additional detailed information about the job is provided. + +```json +{ + "status": "IN_PROGRESS", + "total_experiments": 23, + "processed_experiments": 22, + "data": { + "experiments": { + "new": [ + "prometheus-1|default|monitoring|node-exporter(daemonset)|node-exporter", + "prometheus-1|default|cadvisor|cadvisor(daemonset)|cadvisor", + "prometheus-1|default|monitoring|alertmanager-main(statefulset)|config-reloader", + "prometheus-1|default|monitoring|alertmanager-main(statefulset)|alertmanager", + "prometheus-1|default|monitoring|prometheus-operator(deployment)|kube-rbac-proxy", + "prometheus-1|default|kube-system|coredns(deployment)|coredns", + "prometheus-1|default|monitoring|prometheus-k8s(statefulset)|config-reloader", + "prometheus-1|default|monitoring|blackbox-exporter(deployment)|kube-rbac-proxy", + "prometheus-1|default|monitoring|prometheus-operator(deployment)|prometheus-operator", + "prometheus-1|default|monitoring|node-exporter(daemonset)|kube-rbac-proxy", + "prometheus-1|default|monitoring|kube-state-metrics(deployment)|kube-rbac-proxy-self", + "prometheus-1|default|monitoring|kube-state-metrics(deployment)|kube-state-metrics", + "prometheus-1|default|monitoring|kruize(deployment)|kruize", + "prometheus-1|default|monitoring|blackbox-exporter(deployment)|module-configmap-reloader", + "prometheus-1|default|monitoring|prometheus-k8s(statefulset)|prometheus", + "prometheus-1|default|monitoring|kube-state-metrics(deployment)|kube-rbac-proxy-main", + "prometheus-1|default|kube-system|kube-proxy(daemonset)|kube-proxy", + "prometheus-1|default|monitoring|prometheus-adapter(deployment)|prometheus-adapter", + "prometheus-1|default|monitoring|grafana(deployment)|grafana", + "prometheus-1|default|kube-system|kindnet(daemonset)|kindnet-cni", + "prometheus-1|default|monitoring|kruize-db-deployment(deployment)|kruize-db", + "prometheus-1|default|monitoring|blackbox-exporter(deployment)|blackbox-exporter" + ], + "updated": [], + "failed": null + }, + "recommendations": { + "data": { + "processed": [ + "prometheus-1|default|monitoring|alertmanager-main(statefulset)|config-reloader", + "prometheus-1|default|monitoring|node-exporter(daemonset)|node-exporter", + "prometheus-1|default|local-path-storage|local-path-provisioner(deployment)|local-path-provisioner", + "prometheus-1|default|monitoring|alertmanager-main(statefulset)|alertmanager", + "prometheus-1|default|monitoring|prometheus-operator(deployment)|kube-rbac-proxy", + "prometheus-1|default|kube-system|coredns(deployment)|coredns", + "prometheus-1|default|monitoring|blackbox-exporter(deployment)|kube-rbac-proxy", + "prometheus-1|default|monitoring|prometheus-k8s(statefulset)|config-reloader", + "prometheus-1|default|monitoring|prometheus-operator(deployment)|prometheus-operator", + "prometheus-1|default|monitoring|node-exporter(daemonset)|kube-rbac-proxy", + "prometheus-1|default|monitoring|kube-state-metrics(deployment)|kube-rbac-proxy-self", + "prometheus-1|default|monitoring|kube-state-metrics(deployment)|kube-state-metrics", + "prometheus-1|default|monitoring|kruize(deployment)|kruize", + "prometheus-1|default|monitoring|blackbox-exporter(deployment)|module-configmap-reloader", + "prometheus-1|default|monitoring|prometheus-k8s(statefulset)|prometheus", + "prometheus-1|default|monitoring|kube-state-metrics(deployment)|kube-rbac-proxy-main", + "prometheus-1|default|kube-system|kube-proxy(daemonset)|kube-proxy", + "prometheus-1|default|monitoring|prometheus-adapter(deployment)|prometheus-adapter", + "prometheus-1|default|monitoring|grafana(deployment)|grafana", + "prometheus-1|default|kube-system|kindnet(daemonset)|kindnet-cni", + "prometheus-1|default|monitoring|kruize-db-deployment(deployment)|kruize-db", + "prometheus-1|default|monitoring|blackbox-exporter(deployment)|blackbox-exporter" + ], + "processing": [ + "prometheus-1|default|cadvisor|cadvisor(daemonset)|cadvisor" + ], + "unprocessed": [ + ], + "failed": [] + } + } + }, + "job_id": "5798a2df-6c67-467b-a3c2-befe634a0e3a", + "job_start_time": "2024-10-09T18:09:31.549Z", + "job_end_time": null +} +``` + +### Response Parameters + +## API Description: Experiment and Recommendation Processing Status + +This API response describes the status of a job that processes multiple experiments and generates recommendations for +resource optimization in Kubernetes environments. Below is a breakdown of the JSON response: + +### Fields: + +- **status**: + - **Type**: `String` + - **Description**: Current status of the job. Can be "IN_PROGRESS", "COMPLETED", "FAILED", etc. + +- **total_experiments**: + - **Type**: `Integer` + - **Description**: Total number of experiments to be processed in the job. + +- **processed_experiments**: + - **Type**: `Integer` + - **Description**: Number of experiments that have been processed so far. + +- **data**: + - **Type**: `Object` + - **Description**: Contains detailed information about the experiments and recommendations being processed. + + - **experiments**: + - **new**: + - **Type**: `Array of Strings` + - **Description**: List of new experiments that have been identified but not yet processed. + + - **updated**: + - **Type**: `Array of Strings` + - **Description**: List of experiments that were previously processed but have now been updated. + + - **failed**: + - **Type**: `null or Array` + - **Description**: List of experiments that failed during processing. If no failures, the value is `null`. + + - **recommendations**: + - **data**: + - **processed**: + - **Type**: `Array of Strings` + - **Description**: List of experiments for which recommendations have already been processed. + + - **processing**: + - **Type**: `Array of Strings` + - **Description**: List of experiments that are currently being processed for recommendations. + + - **unprocessed**: + - **Type**: `Array of Strings` + - **Description**: List of experiments that have not yet been processed for recommendations. + + - **failed**: + - **Type**: `Array of Strings` + - **Description**: List of experiments for which the recommendation process failed. + +- **job_id**: + - **Type**: `String` + - **Description**: Unique identifier for the job. + +- **job_start_time**: + - **Type**: `String (ISO 8601 format)` + - **Description**: Start timestamp of the job. + +- **job_end_time**: + - **Type**: `String (ISO 8601 format) or null` + - **Description**: End timestamp of the job. If the job is still in progress, this will be `null`. + diff --git a/design/MonitoringModeAPI.md b/design/MonitoringModeAPI.md index 75899125d..91a3d1364 100644 --- a/design/MonitoringModeAPI.md +++ b/design/MonitoringModeAPI.md @@ -2960,6 +2960,506 @@ Returns the recommendation at a particular timestamp if it exists + +**Response for GPU workloads** + +`GET /listRecommendations` + +`curl -H 'Accept: application/json' http://:/listRecommendations?experiment_name=job-01` + +
+Example Response with GPU Recommendations + +```json +[ + { + "cluster_name": "default", + "experiment_type": "container", + "kubernetes_objects": [ + { + "type": "statefulset", + "name": "human-eval-benchmark", + "namespace": "unpartitioned", + "containers": [ + { + "container_name": "human-eval-benchmark", + "recommendations": { + "version": "1.0", + "notifications": { + "111000": { + "type": "info", + "message": "Recommendations Are Available", + "code": 111000 + } + }, + "data": { + "2024-10-04T09:16:40.000Z": { + "notifications": { + "111101": { + "type": "info", + "message": "Short Term Recommendations Available", + "code": 111101 + }, + "111102": { + "type": "info", + "message": "Medium Term Recommendations Available", + "code": 111102 + } + }, + "monitoring_end_time": "2024-10-04T09:16:40.000Z", + "current": { + "limits": { + "cpu": { + "amount": 2.0, + "format": "cores" + }, + "memory": { + "amount": 8.589934592E9, + "format": "bytes" + } + }, + "requests": { + "cpu": { + "amount": 1.0, + "format": "cores" + }, + "memory": { + "amount": 8.589934592E9, + "format": "bytes" + } + } + }, + "recommendation_terms": { + "short_term": { + "duration_in_hours": 24.0, + "notifications": { + "112101": { + "type": "info", + "message": "Cost Recommendations Available", + "code": 112101 + }, + "112102": { + "type": "info", + "message": "Performance Recommendations Available", + "code": 112102 + } + }, + "monitoring_start_time": "2024-10-03T09:16:40.000Z", + "recommendation_engines": { + "cost": { + "pods_count": 1, + "confidence_level": 0.0, + "config": { + "limits": { + "cpu": { + "amount": 1.004649523106615, + "format": "cores" + }, + "nvidia.com/mig-3g.20gb": { + "amount": 1.0, + "format": "cores" + }, + "memory": { + "amount": 4.9960943616E9, + "format": "bytes" + } + }, + "requests": { + "cpu": { + "amount": 1.004649523106615, + "format": "cores" + }, + "memory": { + "amount": 4.9960943616E9, + "format": "bytes" + } + } + }, + "variation": { + "limits": { + "cpu": { + "amount": -0.995350476893385, + "format": "cores" + }, + "memory": { + "amount": -3.5938402303999996E9, + "format": "bytes" + } + }, + "requests": { + "cpu": { + "amount": 0.004649523106615039, + "format": "cores" + }, + "memory": { + "amount": -3.5938402303999996E9, + "format": "bytes" + } + } + }, + "notifications": {} + }, + "performance": { + "pods_count": 1, + "confidence_level": 0.0, + "config": { + "limits": { + "cpu": { + "amount": 1.36656145696268, + "format": "cores" + }, + "memory": { + "amount": 4.9960943616E9, + "format": "bytes" + }, + "nvidia.com/mig-4g.20gb": { + "amount": 1.0, + "format": "cores" + } + }, + "requests": { + "cpu": { + "amount": 1.36656145696268, + "format": "cores" + }, + "memory": { + "amount": 4.9960943616E9, + "format": "bytes" + } + } + }, + "variation": { + "limits": { + "cpu": { + "amount": -0.63343854303732, + "format": "cores" + }, + "memory": { + "amount": -3.5938402303999996E9, + "format": "bytes" + } + }, + "requests": { + "cpu": { + "amount": 0.36656145696268005, + "format": "cores" + }, + "memory": { + "amount": -3.5938402303999996E9, + "format": "bytes" + } + } + }, + "notifications": {} + } + }, + "plots": { + "datapoints": 4, + "plots_data": { + "2024-10-04T09:16:40.000Z": { + "cpuUsage": { + "min": 0.005422723351267242, + "q1": 1.003281151419465, + "median": 1.0118160468783521, + "q3": 1.012961901380266, + "max": 1.36656145696268, + "format": "cores" + }, + "memoryUsage": { + "min": 3.68019456E9, + "q1": 3.681001472E9, + "median": 4.058411008E9, + "q3": 4.093308928E9, + "max": 4.094062592E9, + "format": "bytes" + } + }, + "2024-10-04T03:16:40.000Z": { + "cpuUsage": { + "min": 0.998888009348188, + "q1": 1.0029943714818779, + "median": 1.0033621837551019, + "q3": 1.0040859908301978, + "max": 1.0828338199135354, + "format": "cores" + }, + "memoryUsage": { + "min": 3.679281152E9, + "q1": 3.680755712E9, + "median": 3.680989184E9, + "q3": 3.687673856E9, + "max": 4.163411968E9, + "format": "bytes" + } + }, + "2024-10-03T15:16:40.000Z": { + "cpuUsage": { + "min": 0.005425605536480822, + "q1": 0.006038658069363403, + "median": 0.006183237135144752, + "q3": 0.006269460195927269, + "max": 0.006916437328481231, + "format": "cores" + }, + "memoryUsage": { + "min": 2.192125952E9, + "q1": 2.192388096E9, + "median": 2.192388096E9, + "q3": 2.192388096E9, + "max": 2.19265024E9, + "format": "bytes" + } + }, + "2024-10-03T21:16:40.000Z": { + "cpuUsage": { + "min": 0.0052184839046300075, + "q1": 0.006229799261227028, + "median": 1.0110868114913476, + "q3": 1.0124661560983785, + "max": 2.3978065580305032, + "format": "cores" + }, + "memoryUsage": { + "min": 2.118012928E9, + "q1": 2.192392192E9, + "median": 4.161662976E9, + "q3": 4.162850816E9, + "max": 4.163170304E9, + "format": "bytes" + } + } + } + } + }, + "medium_term": { + "duration_in_hours": 168.0, + "notifications": { + "112101": { + "type": "info", + "message": "Cost Recommendations Available", + "code": 112101 + }, + "112102": { + "type": "info", + "message": "Performance Recommendations Available", + "code": 112102 + } + }, + "monitoring_start_time": "2024-09-27T09:16:40.000Z", + "recommendation_engines": { + "cost": { + "pods_count": 1, + "confidence_level": 0.0, + "config": { + "limits": { + "cpu": { + "amount": 0.015580688959425347, + "format": "cores" + }, + "nvidia.com/mig-3g.20gb": { + "amount": 1.0, + "format": "cores" + }, + "memory": { + "amount": 4.9960943616E9, + "format": "bytes" + } + }, + "requests": { + "cpu": { + "amount": 0.015580688959425347, + "format": "cores" + }, + "memory": { + "amount": 4.9960943616E9, + "format": "bytes" + } + } + }, + "variation": { + "limits": { + "cpu": { + "amount": -1.9844193110405746, + "format": "cores" + }, + "memory": { + "amount": -3.5938402303999996E9, + "format": "bytes" + } + }, + "requests": { + "cpu": { + "amount": -0.9844193110405747, + "format": "cores" + }, + "memory": { + "amount": -3.5938402303999996E9, + "format": "bytes" + } + } + }, + "notifications": {} + }, + "performance": { + "pods_count": 1, + "confidence_level": 0.0, + "config": { + "limits": { + "cpu": { + "amount": 1.025365696933566, + "format": "cores" + }, + "memory": { + "amount": 4.9960943616E9, + "format": "bytes" + }, + "nvidia.com/mig-4g.20gb": { + "amount": 1.0, + "format": "cores" + } + }, + "requests": { + "cpu": { + "amount": 1.025365696933566, + "format": "cores" + }, + "memory": { + "amount": 4.9960943616E9, + "format": "bytes" + } + } + }, + "variation": { + "limits": { + "cpu": { + "amount": -0.974634303066434, + "format": "cores" + }, + "memory": { + "amount": -3.5938402303999996E9, + "format": "bytes" + } + }, + "requests": { + "cpu": { + "amount": 0.02536569693356605, + "format": "cores" + }, + "memory": { + "amount": -3.5938402303999996E9, + "format": "bytes" + } + } + }, + "notifications": {} + } + }, + "plots": { + "datapoints": 7, + "plots_data": { + "2024-09-29T09:16:40.000Z": {}, + "2024-10-04T09:16:40.000Z": { + "cpuUsage": { + "min": 0.0052184839046300075, + "q1": 0.006207971650471658, + "median": 1.0032201196711934, + "q3": 1.0115567178617741, + "max": 2.3978065580305032, + "format": "cores" + }, + "memoryUsage": { + "min": 2.118012928E9, + "q1": 2.192392192E9, + "median": 3.6808704E9, + "q3": 4.093349888E9, + "max": 4.163411968E9, + "format": "bytes" + } + }, + "2024-09-30T09:16:40.000Z": {}, + "2024-10-02T09:16:40.000Z": { + "cpuUsage": { + "min": 0.00554280490421283, + "q1": 0.015358846193868379, + "median": 0.015705212168337323, + "q3": 1.010702281083678, + "max": 1.0139464901392594, + "format": "cores" + }, + "memoryUsage": { + "min": 2.192125952E9, + "q1": 2.717663232E9, + "median": 2.719612928E9, + "q3": 2.719617024E9, + "max": 2.720600064E9, + "format": "bytes" + } + }, + "2024-09-28T09:16:40.000Z": {}, + "2024-10-03T09:16:40.000Z": { + "cpuUsage": { + "min": 0.005373319820852367, + "q1": 0.006054991034195089, + "median": 0.006142447129874265, + "q3": 0.006268777122325054, + "max": 0.007366566784856696, + "format": "cores" + }, + "memoryUsage": { + "min": 2.192125952E9, + "q1": 2.192388096E9, + "median": 2.192388096E9, + "q3": 2.192388096E9, + "max": 2.192654336E9, + "format": "bytes" + } + }, + "2024-10-01T09:16:40.000Z": { + "cpuUsage": { + "min": 0.003319077875529473, + "q1": 1.0101034685479167, + "median": 1.0118171810142638, + "q3": 1.0208974318073034, + "max": 3.5577616386258963, + "format": "cores" + }, + "memoryUsage": { + "min": 1.77057792E8, + "q1": 2.64523776E9, + "median": 2.651078656E9, + "q3": 2.693431296E9, + "max": 2.705133568E9, + "format": "bytes" + } + } + } + } + }, + "long_term": { + "duration_in_hours": 360.0, + "notifications": { + "120001": { + "type": "info", + "message": "There is not enough data available to generate a recommendation.", + "code": 120001 + } + } + } + } + } + } + } + } + ] + } + ], + "version": "v2.0", + "experiment_name": "human_eval_exp" + } +] +``` +
+ ### Invalid Scenarios:
@@ -5049,6 +5549,11 @@ structured and easily interpretable way for users or external systems to access
+ + + + + --- diff --git a/manifests/autotune/performance-profiles/resource_optimization_local_monitoring.json b/manifests/autotune/performance-profiles/resource_optimization_local_monitoring.json index add7fd4ca..d2e243127 100644 --- a/manifests/autotune/performance-profiles/resource_optimization_local_monitoring.json +++ b/manifests/autotune/performance-profiles/resource_optimization_local_monitoring.json @@ -412,6 +412,46 @@ "query": "max(last_over_time(timestamp((sum by (namespace) (container_cpu_usage_seconds_total{namespace=\"$NAMESPACE$\"})) > 0 )[15d:]))" } ] + }, + { + "name": "gpuCoreUsage", + "datasource": "prometheus", + "value_type": "double", + "kubernetes_object": "container", + "aggregation_functions": [ + { + "function": "avg", + "query": "avg by (Hostname,device,modelName,UUID,exported_container,exported_namespace) (avg_over_time(DCGM_FI_DEV_GPU_UTIL{exported_namespace=\"$NAMESPACE$\",exported_container=\"$CONTAINER_NAME$\"}[$MEASUREMENT_DURATION_IN_MIN$m]))" + }, + { + "function": "max", + "query": "max by (Hostname,device,modelName,UUID,exported_container,exported_namespace) (max_over_time(DCGM_FI_DEV_GPU_UTIL{exported_namespace=\"$NAMESPACE$\",exported_container=\"$CONTAINER_NAME$\"}[$MEASUREMENT_DURATION_IN_MIN$m]))" + }, + { + "function": "min", + "query": "min by (Hostname,device,modelName,UUID,exported_container,exported_namespace) (min_over_time(DCGM_FI_DEV_GPU_UTIL{exported_namespace=\"$NAMESPACE$\",exported_container=\"$CONTAINER_NAME$\"}[$MEASUREMENT_DURATION_IN_MIN$m]))" + } + ] + }, + { + "name": "gpuMemoryUsage", + "datasource": "prometheus", + "value_type": "double", + "kubernetes_object": "container", + "aggregation_functions": [ + { + "function": "avg", + "query": "avg by (Hostname,device,modelName,UUID,exported_container,exported_namespace) (avg_over_time(DCGM_FI_DEV_MEM_COPY_UTIL{exported_namespace=\"$NAMESPACE$\",exported_container=\"$CONTAINER_NAME$\"}[$MEASUREMENT_DURATION_IN_MIN$m]))" + }, + { + "function": "max", + "query": "max by (Hostname,device,modelName,UUID,exported_container,exported_namespace) (max_over_time(DCGM_FI_DEV_MEM_COPY_UTIL{exported_namespace=\"$NAMESPACE$\",exported_container=\"$CONTAINER_NAME$\"}[$MEASUREMENT_DURATION_IN_MIN$m]))" + }, + { + "function": "min", + "query": "min by (Hostname,device,modelName,UUID,exported_container,exported_namespace) (min_over_time(DCGM_FI_DEV_MEM_COPY_UTIL{exported_namespace=\"$NAMESPACE$\",exported_container=\"$CONTAINER_NAME$\"}[$MEASUREMENT_DURATION_IN_MIN$m]))" + } + ] } ] } diff --git a/manifests/autotune/performance-profiles/resource_optimization_local_monitoring.yaml b/manifests/autotune/performance-profiles/resource_optimization_local_monitoring.yaml index e638c07e9..92a68a6b2 100644 --- a/manifests/autotune/performance-profiles/resource_optimization_local_monitoring.yaml +++ b/manifests/autotune/performance-profiles/resource_optimization_local_monitoring.yaml @@ -247,167 +247,207 @@ slo: - function: max query: 'max by(namespace,container) (last_over_time((timestamp(container_cpu_usage_seconds_total{namespace="$NAMESPACE$", container="$CONTAINER_NAME$"} > 0))[15d:]))' - ## namespace related queries - - # Namespace quota for CPU requests - # Show namespace quota for CPU requests in cores for a namespace - - name: namespaceCpuRequest - datasource: prometheus - value_type: "double" - kubernetes_object: "namespace" - aggregation_functions: - # sum of all cpu request quotas for a namespace in cores - - function: sum - query: 'sum by (namespace) (kube_resourcequota{namespace="$NAMESPACE$", resource="requests.cpu", type="hard"})' - - # Namespace quota for CPU limits - # Show namespace quota for CPU limits in cores for a namespace - - name: namespaceCpuLimit - datasource: prometheus - value_type: "double" - kubernetes_object: "namespace" - aggregation_functions: - # sum of all cpu limits quotas for a namespace in cores - - function: sum - query: 'sum by (namespace) (kube_resourcequota{namespace="$NAMESPACE$", resource="limits.cpu", type="hard"})' - - - # Namespace quota for memory requests - # Show namespace quota for memory requests in bytes for a namespace - - name: namespaceMemoryRequest - datasource: prometheus - value_type: "double" - kubernetes_object: "namespace" - aggregation_functions: - # sum of all memory requests quotas for a namespace in bytes - - function: sum - query: 'sum by (namespace) (kube_resourcequota{namespace="$NAMESPACE$", resource="requests.memory", type="hard"})' - - - # Namespace quota for memory limits - # Show namespace quota for memory limits in bytes for a namespace - - name: namespaceMemoryLimit - datasource: prometheus - value_type: "double" - kubernetes_object: "namespace" - aggregation_functions: - # sum of all memory limits quotas for a namespace in bytes - - function: sum - query: 'sum by (namespace) (kube_resourcequota{namespace="$NAMESPACE$", resource="limits.memory", type="hard"})' - - - # Namespace CPU usage - # Show cpu usages in cores for a namespace - - name: namespaceCpuUsage - datasource: prometheus - value_type: "double" - kubernetes_object: "namespace" - aggregation_functions: - # average cpu usages in cores for a namespace - - function: avg - query: 'avg_over_time(sum by(namespace) (node_namespace_pod_container:container_cpu_usage_seconds_total:sum_irate{namespace="$NAMESPACE$", container!="", container!="POD", pod!=""})[$MEASUREMENT_DURATION_IN_MIN$m:])' + ## namespace related queries - # maximum cpu usages in cores for a namespace - - function: max - query: 'max_over_time(sum by(namespace) (node_namespace_pod_container:container_cpu_usage_seconds_total:sum_irate{namespace="$NAMESPACE$", container!="", container!="POD", pod!=""})[$MEASUREMENT_DURATION_IN_MIN$m:])' + # Namespace quota for CPU requests + # Show namespace quota for CPU requests in cores for a namespace + - name: namespaceCpuRequest + datasource: prometheus + value_type: "double" + kubernetes_object: "namespace" + aggregation_functions: + # sum of all cpu request quotas for a namespace in cores + - function: sum + query: 'sum by (namespace) (kube_resourcequota{namespace="$NAMESPACE$", resource="requests.cpu", type="hard"})' - # minimum cpu usages in cores for a namespace - - function: min - query: 'min_over_time(sum by(namespace) (node_namespace_pod_container:container_cpu_usage_seconds_total:sum_irate{namespace="$NAMESPACE$", container!="", container!="POD", pod!=""})[$MEASUREMENT_DURATION_IN_MIN$m:])' + # Namespace quota for CPU limits + # Show namespace quota for CPU limits in cores for a namespace + - name: namespaceCpuLimit + datasource: prometheus + value_type: "double" + kubernetes_object: "namespace" + aggregation_functions: + # sum of all cpu limits quotas for a namespace in cores + - function: sum + query: 'sum by (namespace) (kube_resourcequota{namespace="$NAMESPACE$", resource="limits.cpu", type="hard"})' - # Namespace CPU Throttle - # Show cpu throttle in cores for a namespace - - name: namespaceCpuThrottle - datasource: prometheus - value_type: "double" - kubernetes_object: "namespace" - aggregation_functions: - # average cpu throttle in cores for a namespace - - function: avg - query: 'avg_over_time(sum by(namespace) (rate(container_cpu_cfs_throttled_seconds_total{namespace="$NAMESPACE$", container!="", container!="POD", pod!=""}[5m]))[$MEASUREMENT_DURATION_IN_MIN$m:])' + # Namespace quota for memory requests + # Show namespace quota for memory requests in bytes for a namespace + - name: namespaceMemoryRequest + datasource: prometheus + value_type: "double" + kubernetes_object: "namespace" + aggregation_functions: + # sum of all memory requests quotas for a namespace in bytes + - function: sum + query: 'sum by (namespace) (kube_resourcequota{namespace="$NAMESPACE$", resource="requests.memory", type="hard"})' - # maximum cpu throttle in cores for a namespace - - function: max - query: 'max_over_time(sum by(namespace) (rate(container_cpu_cfs_throttled_seconds_total{namespace="$NAMESPACE$", container!="", container!="POD", pod!=""}[5m]))[$MEASUREMENT_DURATION_IN_MIN$m:])' - # minimum cpu throttle in cores for a namespace - - function: min - query: 'min_over_time(sum by(namespace) (rate(container_cpu_cfs_throttled_seconds_total{namespace="$NAMESPACE$", container!="", container!="POD", pod!=""}[5m]))[$MEASUREMENT_DURATION_IN_MIN$m:])' + # Namespace quota for memory limits + # Show namespace quota for memory limits in bytes for a namespace + - name: namespaceMemoryLimit + datasource: prometheus + value_type: "double" + kubernetes_object: "namespace" + aggregation_functions: + # sum of all memory limits quotas for a namespace in bytes + - function: sum + query: 'sum by (namespace) (kube_resourcequota{namespace="$NAMESPACE$", resource="limits.memory", type="hard"})' - # Namespace memory usage - # Show memory usages in bytes for a namespace - - name: namespaceMemoryUsage - datasource: prometheus - value_type: "double" - kubernetes_object: "namespace" - aggregation_functions: - # average memory usage in bytes for a namespace - - function: avg - query: 'avg_over_time(sum by(namespace) (container_memory_working_set_bytes{namespace="$NAMESPACE$", container!="", container!="POD", pod!=""})[$MEASUREMENT_DURATION_IN_MIN$m:])' + # Namespace CPU usage + # Show cpu usages in cores for a namespace + - name: namespaceCpuUsage + datasource: prometheus + value_type: "double" + kubernetes_object: "namespace" + aggregation_functions: + # average cpu usages in cores for a namespace + - function: avg + query: 'avg_over_time(sum by(namespace) (node_namespace_pod_container:container_cpu_usage_seconds_total:sum_irate{namespace="$NAMESPACE$", container!="", container!="POD", pod!=""})[$MEASUREMENT_DURATION_IN_MIN$m:])' - # maximum memory usage in bytes for a namespace - - function: max - query: 'max_over_time(sum by(namespace) (container_memory_working_set_bytes{namespace="$NAMESPACE$", container!="", container!="POD", pod!=""})[$MEASUREMENT_DURATION_IN_MIN$m:])' + # maximum cpu usages in cores for a namespace + - function: max + query: 'max_over_time(sum by(namespace) (node_namespace_pod_container:container_cpu_usage_seconds_total:sum_irate{namespace="$NAMESPACE$", container!="", container!="POD", pod!=""})[$MEASUREMENT_DURATION_IN_MIN$m:])' + + # minimum cpu usages in cores for a namespace + - function: min + query: 'min_over_time(sum by(namespace) (node_namespace_pod_container:container_cpu_usage_seconds_total:sum_irate{namespace="$NAMESPACE$", container!="", container!="POD", pod!=""})[$MEASUREMENT_DURATION_IN_MIN$m:])' + + + # Namespace CPU Throttle + # Show cpu throttle in cores for a namespace + - name: namespaceCpuThrottle + datasource: prometheus + value_type: "double" + kubernetes_object: "namespace" + aggregation_functions: + # average cpu throttle in cores for a namespace + - function: avg + query: 'avg_over_time(sum by(namespace) (rate(container_cpu_cfs_throttled_seconds_total{namespace="$NAMESPACE$", container!="", container!="POD", pod!=""}[5m]))[$MEASUREMENT_DURATION_IN_MIN$m:])' + + # maximum cpu throttle in cores for a namespace + - function: max + query: 'max_over_time(sum by(namespace) (rate(container_cpu_cfs_throttled_seconds_total{namespace="$NAMESPACE$", container!="", container!="POD", pod!=""}[5m]))[$MEASUREMENT_DURATION_IN_MIN$m:])' + + # minimum cpu throttle in cores for a namespace + - function: min + query: 'min_over_time(sum by(namespace) (rate(container_cpu_cfs_throttled_seconds_total{namespace="$NAMESPACE$", container!="", container!="POD", pod!=""}[5m]))[$MEASUREMENT_DURATION_IN_MIN$m:])' - # minimum memory usage in bytes for a namespace - - function: min - query: 'min_over_time(sum by(namespace) (container_memory_working_set_bytes{namespace="$NAMESPACE$", container!="", container!="POD", pod!=""})[$MEASUREMENT_DURATION_IN_MIN$m:])' + # Namespace memory usage + # Show memory usages in bytes for a namespace + - name: namespaceMemoryUsage + datasource: prometheus + value_type: "double" + kubernetes_object: "namespace" + aggregation_functions: + # average memory usage in bytes for a namespace + - function: avg + query: 'avg_over_time(sum by(namespace) (container_memory_working_set_bytes{namespace="$NAMESPACE$", container!="", container!="POD", pod!=""})[$MEASUREMENT_DURATION_IN_MIN$m:])' + + # maximum memory usage in bytes for a namespace + - function: max + query: 'max_over_time(sum by(namespace) (container_memory_working_set_bytes{namespace="$NAMESPACE$", container!="", container!="POD", pod!=""})[$MEASUREMENT_DURATION_IN_MIN$m:])' - # Namespace memory rss value - # Show memory rss in bytes for a namespace - - name: namespaceMemoryRSS - datasource: prometheus - value_type: "double" - kubernetes_object: "namespace" - aggregation_functions: - # average memory rss in bytes for a namespace + # minimum memory usage in bytes for a namespace + - function: min + query: 'min_over_time(sum by(namespace) (container_memory_working_set_bytes{namespace="$NAMESPACE$", container!="", container!="POD", pod!=""})[$MEASUREMENT_DURATION_IN_MIN$m:])' + + + # Namespace memory rss value + # Show memory rss in bytes for a namespace + - name: namespaceMemoryRSS + datasource: prometheus + value_type: "double" + kubernetes_object: "namespace" + aggregation_functions: + # average memory rss in bytes for a namespace + - function: avg + query: 'avg_over_time(sum by(namespace) (container_memory_rss{namespace="$NAMESPACE$", container!="", container!="POD", pod!=""})[$MEASUREMENT_DURATION_IN_MIN$m:])' + + # maximum memory rss in bytes for a namespace + - function: max + query: 'max_over_time(sum by(namespace) (container_memory_rss{namespace="$NAMESPACE$", container!="", container!="POD", pod!=""})[$MEASUREMENT_DURATION_IN_MIN$m:])' + + # minimum memory rss in bytes for a namespace + - function: min + query: 'min_over_time(sum by(namespace) (container_memory_rss{namespace="$NAMESPACE$", container!="", container!="POD", pod!=""})[$MEASUREMENT_DURATION_IN_MIN$m:])' + + + # Show total pods in a namespace + - name: namespaceTotalPods + datasource: prometheus + value_type: "double" + kubernetes_object: "namespace" + aggregation_functions: + # maximum total pods in a namespace + - function: max + query: 'max_over_time(sum by(namespace) ((kube_pod_info{namespace="$NAMESPACE$"}))[$MEASUREMENT_DURATION_IN_MIN$m:])' + # average total pods in a namespace + - function: avg + query: 'avg_over_time(sum by(namespace) ((kube_pod_info{namespace="$NAMESPACE$"}))[$MEASUREMENT_DURATION_IN_MIN$m:])' + + + # Show total running pods in a namespace + - name: namespaceRunningPods + datasource: prometheus + value_type: "double" + kubernetes_object: "namespace" + aggregation_functions: + # maximum total pods in a namespace + - function: max + query: 'max_over_time(sum by(namespace) ((kube_pod_status_phase{phase="Running"}))[$MEASUREMENT_DURATION_IN_MIN$m:])' + # average total pods in a namespace + - function: avg + query: 'avg_over_time(sum by(namespace) ((kube_pod_status_phase{phase="Running"}))[$MEASUREMENT_DURATION_IN_MIN$m:])' + + # Show last activity for a namespace + - name: namespaceMaxDate + datasource: prometheus + value_type: "double" + kubernetes_object: "namespace" + aggregation_functions: + - function: max + query: 'max(last_over_time(timestamp((sum by (namespace) (container_cpu_usage_seconds_total{namespace="$NAMESPACE$"})) > 0 )[15d:]))' + + # GPU Related metrics + + # GPU Core Usage + - name: gpuCoreUsage + datasource: prometheus + value_type: "double" + kubernetes_object: "container" + + aggregation_functions: + # Average GPU Core Usage Percentage per container in a deployment - function: avg - query: 'avg_over_time(sum by(namespace) (container_memory_rss{namespace="$NAMESPACE$", container!="", container!="POD", pod!=""})[$MEASUREMENT_DURATION_IN_MIN$m:])' + query: 'avg by (Hostname,device,modelName,UUID,exported_container,exported_namespace) (avg_over_time(DCGM_FI_DEV_GPU_UTIL{exported_namespace="$NAMESPACE$",exported_container="$CONTAINER_NAME$"}[$MEASUREMENT_DURATION_IN_MIN$m])' - # maximum memory rss in bytes for a namespace + # Maximum GPU Core Usage Percentage per container in a deployment - function: max - query: 'max_over_time(sum by(namespace) (container_memory_rss{namespace="$NAMESPACE$", container!="", container!="POD", pod!=""})[$MEASUREMENT_DURATION_IN_MIN$m:])' + query: 'max by (Hostname,device,modelName,UUID,exported_container,exported_namespace) (max_over_time(DCGM_FI_DEV_GPU_UTIL{exported_namespace="$NAMESPACE$",exported_container="$CONTAINER_NAME$"}[$MEASUREMENT_DURATION_IN_MIN$m])' - # minimum memory rss in bytes for a namespace + # Minimum of GPU Core Usage Percentage for a container in a deployment - function: min - query: 'min_over_time(sum by(namespace) (container_memory_rss{namespace="$NAMESPACE$", container!="", container!="POD", pod!=""})[$MEASUREMENT_DURATION_IN_MIN$m:])' + query: 'min by (Hostname,device,modelName,UUID,exported_container,exported_namespace) (min_over_time(DCGM_FI_DEV_GPU_UTIL{exported_namespace="$NAMESPACE$",exported_container="$CONTAINER_NAME$"}[$MEASUREMENT_DURATION_IN_MIN$m])' + # GPU Memory usage + - name: gpuMemoryUsage + datasource: prometheus + value_type: "double" + kubernetes_object: "container" - # Show total pods in a namespace - - name: namespaceTotalPods - datasource: prometheus - value_type: "double" - kubernetes_object: "namespace" - aggregation_functions: - # maximum total pods in a namespace - - function: max - query: 'max_over_time(sum by(namespace) ((kube_pod_info{namespace="$NAMESPACE$"}))[$MEASUREMENT_DURATION_IN_MIN$m:])' - # average total pods in a namespace + aggregation_functions: + # Average GPU Memory Usage Percentage per container in a deployment - function: avg - query: 'avg_over_time(sum by(namespace) ((kube_pod_info{namespace="$NAMESPACE$"}))[$MEASUREMENT_DURATION_IN_MIN$m:])' + query: 'avg by (Hostname,device,modelName,UUID,exported_container,exported_namespace) (avg_over_time(DCGM_FI_DEV_MEM_COPY_UTIL{exported_namespace="$NAMESPACE$",exported_container="$CONTAINER_NAME$"}[$MEASUREMENT_DURATION_IN_MIN$m])' - - # Show total running pods in a namespace - - name: namespaceRunningPods - datasource: prometheus - value_type: "double" - kubernetes_object: "namespace" - aggregation_functions: - # maximum total pods in a namespace - - function: max - query: 'max_over_time(sum by(namespace) ((kube_pod_status_phase{phase="Running"}))[$MEASUREMENT_DURATION_IN_MIN$m:])' - # average total pods in a namespace - - function: avg - query: 'avg_over_time(sum by(namespace) ((kube_pod_status_phase{phase="Running"}))[$MEASUREMENT_DURATION_IN_MIN$m:])' - - # Show last activity for a namespace - - name: namespaceMaxDate - datasource: prometheus - value_type: "double" - kubernetes_object: "namespace" - aggregation_functions: + # Maximum GPU Memory Usage Percentage per container in a deployment - function: max - query: 'max(last_over_time(timestamp((sum by (namespace) (container_cpu_usage_seconds_total{namespace="$NAMESPACE$"})) > 0 )[15d:]))' + query: 'max by (Hostname,device,modelName,UUID,exported_container,exported_namespace) (max_over_time(DCGM_FI_DEV_MEM_COPY_UTIL{exported_namespace="$NAMESPACE$",exported_container="$CONTAINER_NAME$"}[$MEASUREMENT_DURATION_IN_MIN$m])' + + # Minimum of GPU Memory Usage Percentage for a container in a deployment + - function: min + query: 'min by (Hostname,device,modelName,UUID,exported_container,exported_namespace) (min_over_time(DCGM_FI_DEV_MEM_COPY_UTIL{exported_namespace="$NAMESPACE$",exported_container="$CONTAINER_NAME$"}[$MEASUREMENT_DURATION_IN_MIN$m])' \ No newline at end of file diff --git a/manifests/autotune/performance-profiles/resource_optimization_local_monitoring_norecordingrules.json b/manifests/autotune/performance-profiles/resource_optimization_local_monitoring_norecordingrules.json index eeef1a07e..4f4d261ae 100644 --- a/manifests/autotune/performance-profiles/resource_optimization_local_monitoring_norecordingrules.json +++ b/manifests/autotune/performance-profiles/resource_optimization_local_monitoring_norecordingrules.json @@ -389,6 +389,46 @@ "query": "max(last_over_time(timestamp((sum by (namespace) (container_cpu_usage_seconds_total{namespace=\"$NAMESPACE$\"})) > 0 )[15d:]))" } ] + }, + { + "name": "gpuCoreUsage", + "datasource": "prometheus", + "value_type": "double", + "kubernetes_object": "container", + "aggregation_functions": [ + { + "function": "avg", + "query": "avg by (Hostname,device,modelName,UUID,exported_container,exported_namespace) (avg_over_time(DCGM_FI_DEV_GPU_UTIL{exported_namespace=\"$NAMESPACE$\",exported_container=\"$CONTAINER_NAME$\"}[$MEASUREMENT_DURATION_IN_MIN$m]))" + }, + { + "function": "max", + "query": "max by (Hostname,device,modelName,UUID,exported_container,exported_namespace) (max_over_time(DCGM_FI_DEV_GPU_UTIL{exported_namespace=\"$NAMESPACE$\",exported_container=\"$CONTAINER_NAME$\"}[$MEASUREMENT_DURATION_IN_MIN$m]))" + }, + { + "function": "min", + "query": "min by (Hostname,device,modelName,UUID,exported_container,exported_namespace) (min_over_time(DCGM_FI_DEV_GPU_UTIL{exported_namespace=\"$NAMESPACE$\",exported_container=\"$CONTAINER_NAME$\"}[$MEASUREMENT_DURATION_IN_MIN$m]))" + } + ] + }, + { + "name": "gpuMemoryUsage", + "datasource": "prometheus", + "value_type": "double", + "kubernetes_object": "container", + "aggregation_functions": [ + { + "function": "avg", + "query": "avg by (Hostname,device,modelName,UUID,exported_container,exported_namespace) (avg_over_time(DCGM_FI_DEV_MEM_COPY_UTIL{exported_namespace=\"$NAMESPACE$\",exported_container=\"$CONTAINER_NAME$\"}[$MEASUREMENT_DURATION_IN_MIN$m]))" + }, + { + "function": "max", + "query": "max by (Hostname,device,modelName,UUID,exported_container,exported_namespace) (max_over_time(DCGM_FI_DEV_MEM_COPY_UTIL{exported_namespace=\"$NAMESPACE$\",exported_container=\"$CONTAINER_NAME$\"}[$MEASUREMENT_DURATION_IN_MIN$m]))" + }, + { + "function": "min", + "query": "min by (Hostname,device,modelName,UUID,exported_container,exported_namespace) (min_over_time(DCGM_FI_DEV_MEM_COPY_UTIL{exported_namespace=\"$NAMESPACE$\",exported_container=\"$CONTAINER_NAME$\"}[$MEASUREMENT_DURATION_IN_MIN$m]))" + } + ] } ] } diff --git a/manifests/autotune/performance-profiles/resource_optimization_local_monitoring_norecordingrules.yaml b/manifests/autotune/performance-profiles/resource_optimization_local_monitoring_norecordingrules.yaml index 8a85c70e7..d50d42df1 100644 --- a/manifests/autotune/performance-profiles/resource_optimization_local_monitoring_norecordingrules.yaml +++ b/manifests/autotune/performance-profiles/resource_optimization_local_monitoring_norecordingrules.yaml @@ -210,168 +210,209 @@ slo: - function: max query: 'max by(namespace,container) (last_over_time((timestamp(container_cpu_usage_seconds_total{namespace="$NAMESPACE$", container="$CONTAINER_NAME$"} > 0))[15d:]))' - ## namespace related queries - - # Namespace quota for CPU requests - # Show namespace quota for CPU requests in cores for a namespace - - name: namespaceCpuRequest - datasource: prometheus - value_type: "double" - kubernetes_object: "namespace" - aggregation_functions: - # sum of all cpu request quotas for a namespace in cores - - function: sum - query: 'sum by (namespace) (kube_resourcequota{namespace="$NAMESPACE$", resource="requests.cpu", type="hard"})' - - # Namespace quota for CPU limits - # Show namespace quota for CPU limits in cores for a namespace - - name: namespaceCpuLimit - datasource: prometheus - value_type: "double" - kubernetes_object: "namespace" - aggregation_functions: - # sum of all cpu limits quotas for a namespace in cores - - function: sum - query: 'sum by (namespace) (kube_resourcequota{namespace="$NAMESPACE$", resource="limits.cpu", type="hard"})' - - - # Namespace quota for memory requests - # Show namespace quota for memory requests in bytes for a namespace - - name: namespaceMemoryRequest - datasource: prometheus - value_type: "double" - kubernetes_object: "namespace" - aggregation_functions: - # sum of all memory requests quotas for a namespace in bytes - - function: sum - query: 'sum by (namespace) (kube_resourcequota{namespace="$NAMESPACE$", resource="requests.memory", type="hard"})' - - - # Namespace quota for memory limits - # Show namespace quota for memory limits in bytes for a namespace - - name: namespaceMemoryLimit - datasource: prometheus - value_type: "double" - kubernetes_object: "namespace" - aggregation_functions: - # sum of all memory limits quotas for a namespace in bytes - - function: sum - query: 'sum by (namespace) (kube_resourcequota{namespace="$NAMESPACE$", resource="limits.memory", type="hard"})' - - - # Namespace CPU usage - # Show cpu usages in cores for a namespace - - name: namespaceCpuUsage - datasource: prometheus - value_type: "double" - kubernetes_object: "namespace" - aggregation_functions: - # average cpu usages in cores for a namespace - - function: avg - query: 'avg_over_time(sum by(namespace) (rate(container_cpu_usage_seconds_total{namespace="$NAMESPACE$", container!="", container!="POD", pod!=""}[5m]) )[$MEASUREMENT_DURATION_IN_MIN$m:])' + ## namespace related queries - # maximum cpu usages in cores for a namespace - - function: max - query: 'max_over_time(sum by(namespace) (rate(container_cpu_usage_seconds_total{namespace="$NAMESPACE$", container!="", container!="POD", pod!=""}[5m]) )[$MEASUREMENT_DURATION_IN_MIN$m:])' + # Namespace quota for CPU requests + # Show namespace quota for CPU requests in cores for a namespace + - name: namespaceCpuRequest + datasource: prometheus + value_type: "double" + kubernetes_object: "namespace" + aggregation_functions: + # sum of all cpu request quotas for a namespace in cores + - function: sum + query: 'sum by (namespace) (kube_resourcequota{namespace="$NAMESPACE$", resource="requests.cpu", type="hard"})' - # minimum cpu usages in cores for a namespace - - function: min - query: 'min_over_time(sum by(namespace) (rate(container_cpu_usage_seconds_total{namespace="$NAMESPACE$", container!="", container!="POD", pod!=""}[5m]) )[$MEASUREMENT_DURATION_IN_MIN$m:])' + # Namespace quota for CPU limits + # Show namespace quota for CPU limits in cores for a namespace + - name: namespaceCpuLimit + datasource: prometheus + value_type: "double" + kubernetes_object: "namespace" + aggregation_functions: + # sum of all cpu limits quotas for a namespace in cores + - function: sum + query: 'sum by (namespace) (kube_resourcequota{namespace="$NAMESPACE$", resource="limits.cpu", type="hard"})' - # Namespace CPU Throttle - # Show cpu throttle in cores for a namespace - - name: namespaceCpuThrottle - datasource: prometheus - value_type: "double" - kubernetes_object: "namespace" - aggregation_functions: - # average cpu throttle in cores for a namespace - - function: avg - query: 'avg_over_time(sum by(namespace) (rate(container_cpu_cfs_throttled_seconds_total{namespace="$NAMESPACE$", container!="", container!="POD", pod!=""}[5m]))[$MEASUREMENT_DURATION_IN_MIN$m:])' + # Namespace quota for memory requests + # Show namespace quota for memory requests in bytes for a namespace + - name: namespaceMemoryRequest + datasource: prometheus + value_type: "double" + kubernetes_object: "namespace" + aggregation_functions: + # sum of all memory requests quotas for a namespace in bytes + - function: sum + query: 'sum by (namespace) (kube_resourcequota{namespace="$NAMESPACE$", resource="requests.memory", type="hard"})' - # maximum cpu throttle in cores for a namespace - - function: max - query: 'max_over_time(sum by(namespace) (rate(container_cpu_cfs_throttled_seconds_total{namespace="$NAMESPACE$", container!="", container!="POD", pod!=""}[5m]))[$MEASUREMENT_DURATION_IN_MIN$m:])' - # minimum cpu throttle in cores for a namespace - - function: min - query: 'min_over_time(sum by(namespace) (rate(container_cpu_cfs_throttled_seconds_total{namespace="$NAMESPACE$", container!="", container!="POD", pod!=""}[5m]))[$MEASUREMENT_DURATION_IN_MIN$m:])' + # Namespace quota for memory limits + # Show namespace quota for memory limits in bytes for a namespace + - name: namespaceMemoryLimit + datasource: prometheus + value_type: "double" + kubernetes_object: "namespace" + aggregation_functions: + # sum of all memory limits quotas for a namespace in bytes + - function: sum + query: 'sum by (namespace) (kube_resourcequota{namespace="$NAMESPACE$", resource="limits.memory", type="hard"})' - # Namespace memory usage - # Show memory usages in bytes for a namespace - - name: namespaceMemoryUsage - datasource: prometheus - value_type: "double" - kubernetes_object: "namespace" - aggregation_functions: - # average memory usage in bytes for a namespace - - function: avg - query: 'avg_over_time(sum by(namespace) (container_memory_working_set_bytes{namespace="$NAMESPACE$", container!="", container!="POD", pod!=""})[$MEASUREMENT_DURATION_IN_MIN$m:])' + # Namespace CPU usage + # Show cpu usages in cores for a namespace + - name: namespaceCpuUsage + datasource: prometheus + value_type: "double" + kubernetes_object: "namespace" + aggregation_functions: + # average cpu usages in cores for a namespace + - function: avg + query: 'avg_over_time(sum by(namespace) (rate(container_cpu_usage_seconds_total{namespace="$NAMESPACE$", container!="", container!="POD", pod!=""}[5m]) )[$MEASUREMENT_DURATION_IN_MIN$m:])' - # maximum memory usage in bytes for a namespace - - function: max - query: 'max_over_time(sum by(namespace) (container_memory_working_set_bytes{namespace="$NAMESPACE$", container!="", container!="POD", pod!=""})[$MEASUREMENT_DURATION_IN_MIN$m:])' + # maximum cpu usages in cores for a namespace + - function: max + query: 'max_over_time(sum by(namespace) (rate(container_cpu_usage_seconds_total{namespace="$NAMESPACE$", container!="", container!="POD", pod!=""}[5m]) )[$MEASUREMENT_DURATION_IN_MIN$m:])' + + # minimum cpu usages in cores for a namespace + - function: min + query: 'min_over_time(sum by(namespace) (rate(container_cpu_usage_seconds_total{namespace="$NAMESPACE$", container!="", container!="POD", pod!=""}[5m]) )[$MEASUREMENT_DURATION_IN_MIN$m:])' + + + # Namespace CPU Throttle + # Show cpu throttle in cores for a namespace + - name: namespaceCpuThrottle + datasource: prometheus + value_type: "double" + kubernetes_object: "namespace" + aggregation_functions: + # average cpu throttle in cores for a namespace + - function: avg + query: 'avg_over_time(sum by(namespace) (rate(container_cpu_cfs_throttled_seconds_total{namespace="$NAMESPACE$", container!="", container!="POD", pod!=""}[5m]))[$MEASUREMENT_DURATION_IN_MIN$m:])' + + # maximum cpu throttle in cores for a namespace + - function: max + query: 'max_over_time(sum by(namespace) (rate(container_cpu_cfs_throttled_seconds_total{namespace="$NAMESPACE$", container!="", container!="POD", pod!=""}[5m]))[$MEASUREMENT_DURATION_IN_MIN$m:])' + + # minimum cpu throttle in cores for a namespace + - function: min + query: 'min_over_time(sum by(namespace) (rate(container_cpu_cfs_throttled_seconds_total{namespace="$NAMESPACE$", container!="", container!="POD", pod!=""}[5m]))[$MEASUREMENT_DURATION_IN_MIN$m:])' + + + # Namespace memory usage + # Show memory usages in bytes for a namespace + - name: namespaceMemoryUsage + datasource: prometheus + value_type: "double" + kubernetes_object: "namespace" + aggregation_functions: + # average memory usage in bytes for a namespace + - function: avg + query: 'avg_over_time(sum by(namespace) (container_memory_working_set_bytes{namespace="$NAMESPACE$", container!="", container!="POD", pod!=""})[$MEASUREMENT_DURATION_IN_MIN$m:])' + + # maximum memory usage in bytes for a namespace + - function: max + query: 'max_over_time(sum by(namespace) (container_memory_working_set_bytes{namespace="$NAMESPACE$", container!="", container!="POD", pod!=""})[$MEASUREMENT_DURATION_IN_MIN$m:])' + + # minimum memory usage in bytes for a namespace + - function: min + query: 'min_over_time(sum by(namespace) (container_memory_working_set_bytes{namespace="$NAMESPACE$", container!="", container!="POD", pod!=""})[$MEASUREMENT_DURATION_IN_MIN$m:])' + + + # Namespace memory rss value + # Show memory rss in bytes for a namespace + - name: namespaceMemoryRSS + datasource: prometheus + value_type: "double" + kubernetes_object: "namespace" + aggregation_functions: + # average memory rss in bytes for a namespace + - function: avg + query: 'avg_over_time(sum by(namespace) (container_memory_rss{namespace="$NAMESPACE$", container!="", container!="POD", pod!=""})[$MEASUREMENT_DURATION_IN_MIN$m:])' + + # maximum memory rss in bytes for a namespace + - function: max + query: 'max_over_time(sum by(namespace) (container_memory_rss{namespace="$NAMESPACE$", container!="", container!="POD", pod!=""})[$MEASUREMENT_DURATION_IN_MIN$m:])' + + # minimum memory rss in bytes for a namespace + - function: min + query: 'min_over_time(sum by(namespace) (container_memory_rss{namespace="$NAMESPACE$", container!="", container!="POD", pod!=""})[$MEASUREMENT_DURATION_IN_MIN$m:])' + + + # Show total pods in a namespace + - name: namespaceTotalPods + datasource: prometheus + value_type: "double" + kubernetes_object: "namespace" + aggregation_functions: + # maximum total pods in a namespace + - function: max + query: 'max_over_time(sum by(namespace) ((kube_pod_info{namespace="$NAMESPACE$"}))[$MEASUREMENT_DURATION_IN_MIN$m:])' + # average total pods in a namespace + - function: avg + query: 'avg_over_time(sum by(namespace) ((kube_pod_info{namespace="$NAMESPACE$"}))[$MEASUREMENT_DURATION_IN_MIN$m:])' + + + # Show total running pods in a namespace + - name: namespaceRunningPods + datasource: prometheus + value_type: "double" + kubernetes_object: "namespace" + aggregation_functions: + # maximum total pods in a namespace + - function: max + query: 'max_over_time(sum by(namespace) ((kube_pod_status_phase{phase="Running"}))[$MEASUREMENT_DURATION_IN_MIN$m:])' + # average total pods in a namespace + - function: avg + query: 'avg_over_time(sum by(namespace) ((kube_pod_status_phase{phase="Running"}))[$MEASUREMENT_DURATION_IN_MIN$m:])' + + # Show last activity for a namespace + - name: namespaceMaxDate + datasource: prometheus + value_type: "double" + kubernetes_object: "namespace" + aggregation_functions: + - function: max + query: 'max(last_over_time(timestamp((sum by (namespace) (container_cpu_usage_seconds_total{namespace="$NAMESPACE$"})) > 0 )[15d:]))' - # minimum memory usage in bytes for a namespace - - function: min - query: 'min_over_time(sum by(namespace) (container_memory_working_set_bytes{namespace="$NAMESPACE$", container!="", container!="POD", pod!=""})[$MEASUREMENT_DURATION_IN_MIN$m:])' + # GPU Related metrics + + # GPU Core Usage + - name: gpuCoreUsage + datasource: prometheus + value_type: "double" + kubernetes_object: "container" - # Namespace memory rss value - # Show memory rss in bytes for a namespace - - name: namespaceMemoryRSS - datasource: prometheus - value_type: "double" - kubernetes_object: "namespace" - aggregation_functions: - # average memory rss in bytes for a namespace + aggregation_functions: + # Average GPU Core Usage Percentage per container in a deployment - function: avg - query: 'avg_over_time(sum by(namespace) (container_memory_rss{namespace="$NAMESPACE$", container!="", container!="POD", pod!=""})[$MEASUREMENT_DURATION_IN_MIN$m:])' + query: 'avg by (Hostname,device,modelName,UUID,exported_container,exported_namespace) (avg_over_time(DCGM_FI_DEV_GPU_UTIL{exported_namespace="$NAMESPACE$",exported_container="$CONTAINER_NAME$"}[$MEASUREMENT_DURATION_IN_MIN$m])' - # maximum memory rss in bytes for a namespace + # Maximum GPU Core Usage Percentage per container in a deployment - function: max - query: 'max_over_time(sum by(namespace) (container_memory_rss{namespace="$NAMESPACE$", container!="", container!="POD", pod!=""})[$MEASUREMENT_DURATION_IN_MIN$m:])' + query: 'max by (Hostname,device,modelName,UUID,exported_container,exported_namespace) (max_over_time(DCGM_FI_DEV_GPU_UTIL{exported_namespace="$NAMESPACE$",exported_container="$CONTAINER_NAME$"}[$MEASUREMENT_DURATION_IN_MIN$m])' - # minimum memory rss in bytes for a namespace + # Minimum of GPU Core Usage Percentage for a container in a deployment - function: min - query: 'min_over_time(sum by(namespace) (container_memory_rss{namespace="$NAMESPACE$", container!="", container!="POD", pod!=""})[$MEASUREMENT_DURATION_IN_MIN$m:])' + query: 'min by (Hostname,device,modelName,UUID,exported_container,exported_namespace) (min_over_time(DCGM_FI_DEV_GPU_UTIL{exported_namespace="$NAMESPACE$",exported_container="$CONTAINER_NAME$"}[$MEASUREMENT_DURATION_IN_MIN$m])' + # GPU Memory usage + - name: gpuMemoryUsage + datasource: prometheus + value_type: "double" + kubernetes_object: "container" - # Show total pods in a namespace - - name: namespaceTotalPods - datasource: prometheus - value_type: "double" - kubernetes_object: "namespace" - aggregation_functions: - # maximum total pods in a namespace - - function: max - query: 'max_over_time(sum by(namespace) ((kube_pod_info{namespace="$NAMESPACE$"}))[$MEASUREMENT_DURATION_IN_MIN$m:])' - # average total pods in a namespace + aggregation_functions: + # Average GPU Memory Usage Percentage per container in a deployment - function: avg - query: 'avg_over_time(sum by(namespace) ((kube_pod_info{namespace="$NAMESPACE$"}))[$MEASUREMENT_DURATION_IN_MIN$m:])' - + query: 'avg by (Hostname,device,modelName,UUID,exported_container,exported_namespace) (avg_over_time(DCGM_FI_DEV_MEM_COPY_UTIL{exported_namespace="$NAMESPACE$",exported_container="$CONTAINER_NAME$"}[$MEASUREMENT_DURATION_IN_MIN$m])' - # Show total running pods in a namespace - - name: namespaceRunningPods - datasource: prometheus - value_type: "double" - kubernetes_object: "namespace" - aggregation_functions: - # maximum total pods in a namespace - - function: max - query: 'max_over_time(sum by(namespace) ((kube_pod_status_phase{phase="Running"}))[$MEASUREMENT_DURATION_IN_MIN$m:])' - # average total pods in a namespace - - function: avg - query: 'avg_over_time(sum by(namespace) ((kube_pod_status_phase{phase="Running"}))[$MEASUREMENT_DURATION_IN_MIN$m:])' - - # Show last activity for a namespace - - name: namespaceMaxDate - datasource: prometheus - value_type: "double" - kubernetes_object: "namespace" - aggregation_functions: + # Maximum GPU Memory Usage Percentage per container in a deployment - function: max - query: 'max(last_over_time(timestamp((sum by (namespace) (container_cpu_usage_seconds_total{namespace="$NAMESPACE$"})) > 0 )[15d:]))' + query: 'max by (Hostname,device,modelName,UUID,exported_container,exported_namespace) (max_over_time(DCGM_FI_DEV_MEM_COPY_UTIL{exported_namespace="$NAMESPACE$",exported_container="$CONTAINER_NAME$"}[$MEASUREMENT_DURATION_IN_MIN$m])' + + # Minimum of GPU Memory Usage Percentage for a container in a deployment + - function: min + query: 'min by (Hostname,device,modelName,UUID,exported_container,exported_namespace) (min_over_time(DCGM_FI_DEV_MEM_COPY_UTIL{exported_namespace="$NAMESPACE$",exported_container="$CONTAINER_NAME$"}[$MEASUREMENT_DURATION_IN_MIN$m])' diff --git a/manifests/crc/BYODB-installation/minikube/kruize-crc-minikube.yaml b/manifests/crc/BYODB-installation/minikube/kruize-crc-minikube.yaml index 34e122da5..b556f04f2 100644 --- a/manifests/crc/BYODB-installation/minikube/kruize-crc-minikube.yaml +++ b/manifests/crc/BYODB-installation/minikube/kruize-crc-minikube.yaml @@ -35,6 +35,7 @@ data: "plots": "true", "local": "false", "logAllHttpReqAndResp": "true", + "recommendationsURL" : "http://kruize.monitoring.svc.cluster.local:8080/generateRecommendations?experiment_name=%s", "hibernate": { "dialect": "org.hibernate.dialect.PostgreSQLDialect", "driver": "org.postgresql.Driver", diff --git a/manifests/crc/BYODB-installation/openshift/kruize-crc-openshift.yaml b/manifests/crc/BYODB-installation/openshift/kruize-crc-openshift.yaml index 2528656be..889528b1b 100644 --- a/manifests/crc/BYODB-installation/openshift/kruize-crc-openshift.yaml +++ b/manifests/crc/BYODB-installation/openshift/kruize-crc-openshift.yaml @@ -48,6 +48,7 @@ data: "plots": "true", "local": "false", "logAllHttpReqAndResp": "true", + "recommendationsURL" : "http://kruize.openshift-tuning.svc.cluster.local:8080/generateRecommendations?experiment_name=%s", "hibernate": { "dialect": "org.hibernate.dialect.PostgreSQLDialect", "driver": "org.postgresql.Driver", diff --git a/manifests/crc/default-db-included-installation/aks/kruize-crc-aks.yaml b/manifests/crc/default-db-included-installation/aks/kruize-crc-aks.yaml index 5d7231b80..7d2fd6766 100644 --- a/manifests/crc/default-db-included-installation/aks/kruize-crc-aks.yaml +++ b/manifests/crc/default-db-included-installation/aks/kruize-crc-aks.yaml @@ -99,6 +99,7 @@ data: "plots": "true", "local": "false", "logAllHttpReqAndResp": "true", + "recommendationsURL" : "http://kruize.monitoring.svc.cluster.local:8080/generateRecommendations?experiment_name=%s", "hibernate": { "dialect": "org.hibernate.dialect.PostgreSQLDialect", "driver": "org.postgresql.Driver", diff --git a/manifests/crc/default-db-included-installation/minikube/kruize-crc-minikube.yaml b/manifests/crc/default-db-included-installation/minikube/kruize-crc-minikube.yaml index e8fe59ea8..2366cc669 100644 --- a/manifests/crc/default-db-included-installation/minikube/kruize-crc-minikube.yaml +++ b/manifests/crc/default-db-included-installation/minikube/kruize-crc-minikube.yaml @@ -113,6 +113,7 @@ data: "plots": "true", "local": "false", "logAllHttpReqAndResp": "true", + "recommendationsURL" : "http://kruize.monitoring.svc.cluster.local:8080/generateRecommendations?experiment_name=%s", "hibernate": { "dialect": "org.hibernate.dialect.PostgreSQLDialect", "driver": "org.postgresql.Driver", diff --git a/manifests/crc/default-db-included-installation/openshift/kruize-crc-openshift.yaml b/manifests/crc/default-db-included-installation/openshift/kruize-crc-openshift.yaml index 1f73ae042..bb76b0e0b 100644 --- a/manifests/crc/default-db-included-installation/openshift/kruize-crc-openshift.yaml +++ b/manifests/crc/default-db-included-installation/openshift/kruize-crc-openshift.yaml @@ -107,6 +107,7 @@ data: "plots": "true", "local": "false", "logAllHttpReqAndResp": "true", + "recommendationsURL" : "http://kruize.openshift-tuning.svc.cluster.local:8080/generateRecommendations?experiment_name=%s", "hibernate": { "dialect": "org.hibernate.dialect.PostgreSQLDialect", "driver": "org.postgresql.Driver", diff --git a/src/main/java/com/autotune/analyzer/Analyzer.java b/src/main/java/com/autotune/analyzer/Analyzer.java index 9ebf49199..0c2cea55b 100644 --- a/src/main/java/com/autotune/analyzer/Analyzer.java +++ b/src/main/java/com/autotune/analyzer/Analyzer.java @@ -58,6 +58,7 @@ public static void addServlets(ServletContextHandler context) { context.addServlet(MetricProfileService.class, ServerContext.DELETE_METRIC_PROFILE); context.addServlet(ListDatasources.class, ServerContext.LIST_DATASOURCES); context.addServlet(DSMetadataService.class, ServerContext.DATASOURCE_METADATA); + context.addServlet(BulkService.class, ServerContext.BULK_SERVICE); // Adding UI support API's context.addServlet(ListNamespaces.class, ServerContext.LIST_NAMESPACES); diff --git a/src/main/java/com/autotune/analyzer/adapters/DeviceDetailsAdapter.java b/src/main/java/com/autotune/analyzer/adapters/DeviceDetailsAdapter.java new file mode 100644 index 000000000..57ceaf735 --- /dev/null +++ b/src/main/java/com/autotune/analyzer/adapters/DeviceDetailsAdapter.java @@ -0,0 +1,84 @@ +package com.autotune.analyzer.adapters; + +import com.autotune.analyzer.utils.AnalyzerConstants; +import com.autotune.common.data.system.info.device.DeviceDetails; +import com.autotune.common.data.system.info.device.accelerator.AcceleratorDeviceData; +import com.google.gson.TypeAdapter; +import com.google.gson.stream.JsonReader; +import com.google.gson.stream.JsonWriter; +import java.io.IOException; + + +/** + * This adapter actually specifies the GSON to identify the type of implementation of DeviceDetails + * to serialize or deserialize + */ +public class DeviceDetailsAdapter extends TypeAdapter { + + @Override + public void write(JsonWriter out, DeviceDetails value) throws IOException { + out.beginObject(); + out.name("type").value(value.getType().name()); + + if (value instanceof AcceleratorDeviceData accelerator) { + out.name("manufacturer").value(accelerator.getManufacturer()); + out.name("modelName").value(accelerator.getModelName()); + out.name("hostName").value(accelerator.getHostName()); + out.name("UUID").value(accelerator.getUUID()); + out.name("deviceName").value(accelerator.getDeviceName()); + out.name("isMIG").value(accelerator.isMIG()); + } + // Add for other devices when added + + out.endObject(); + } + + @Override + public DeviceDetails read(JsonReader in) throws IOException { + String type = null; + String manufacturer = null; + String modelName = null; + String hostName = null; + String UUID = null; + String deviceName = null; + boolean isMIG = false; + + in.beginObject(); + while (in.hasNext()) { + switch (in.nextName()) { + case "type": + type = in.nextString(); + break; + case "manufacturer": + manufacturer = in.nextString(); + break; + case "modelName": + modelName = in.nextString(); + break; + case "hostName": + hostName = in.nextString(); + break; + case "UUID": + UUID = in.nextString(); + break; + case "deviceName": + deviceName = in.nextString(); + break; + case "isMIG": + isMIG = in.nextBoolean(); + break; + default: + in.skipValue(); + } + } + in.endObject(); + + if (type != null && type.equals(AnalyzerConstants.DeviceType.ACCELERATOR.name())) { + return (DeviceDetails) new AcceleratorDeviceData(modelName, hostName, UUID, deviceName, isMIG); + } + // Add for other device types if implemented in future + + return null; + } +} + diff --git a/src/main/java/com/autotune/analyzer/adapters/RecommendationItemAdapter.java b/src/main/java/com/autotune/analyzer/adapters/RecommendationItemAdapter.java new file mode 100644 index 000000000..79139fbc4 --- /dev/null +++ b/src/main/java/com/autotune/analyzer/adapters/RecommendationItemAdapter.java @@ -0,0 +1,43 @@ + +package com.autotune.analyzer.adapters; + +import com.autotune.analyzer.utils.AnalyzerConstants; +import com.google.gson.*; + +import java.lang.reflect.Type; + +/** + * Earlier the RecommendationItem enum has only two entries cpu and memory. + * At the time if serialization (store in DB or return as JSON via API) + * java has handled the toString conversion and have converted them to "cpu" and "memory" strings. + * They are also keys in the recommendation (requests & limits) + * + * But in case of NVIDIA the resources have / and . in their string representation of the MIG name. + * So we cannot add them as enums as is, So we had to create an entry which accepts a string + * and then the toString returns the string value of it. + * + * At the time of deserailization the string entries are converted to enum entries and vice versa in serialization. + * For example if the entry is NVIDIA_GPU_PARTITION_1_CORE_5GB("nvidia.com/mig-1g.5gb") then tostring of it + * will be nvidia.com/mig-1g.5gb which will not match the enum entry NVIDIA_GPU_PARTITION_1_CORE_5GB + * + * Also to maintain consistency we changed the cpu to CPU so without the adapter + * the JSON will be generated with CPU as the key. + */ +public class RecommendationItemAdapter implements JsonSerializer, JsonDeserializer { + @Override + public JsonElement serialize(AnalyzerConstants.RecommendationItem recommendationItem, Type type, JsonSerializationContext jsonSerializationContext) { + return jsonSerializationContext.serialize(recommendationItem.toString()); + } + + + @Override + public AnalyzerConstants.RecommendationItem deserialize(JsonElement jsonElement, Type type, JsonDeserializationContext jsonDeserializationContext) throws JsonParseException { + String value = jsonElement.getAsString(); + for (AnalyzerConstants.RecommendationItem item : AnalyzerConstants.RecommendationItem.values()) { + if (item.toString().equals(value)) { + return item; + } + } + throw new JsonParseException("Unknown element " + value); + } +} \ No newline at end of file diff --git a/src/main/java/com/autotune/analyzer/exceptions/KruizeErrorHandler.java b/src/main/java/com/autotune/analyzer/exceptions/KruizeErrorHandler.java index 0f7de32a8..1de629485 100644 --- a/src/main/java/com/autotune/analyzer/exceptions/KruizeErrorHandler.java +++ b/src/main/java/com/autotune/analyzer/exceptions/KruizeErrorHandler.java @@ -15,7 +15,9 @@ *******************************************************************************/ package com.autotune.analyzer.exceptions; +import com.autotune.analyzer.adapters.RecommendationItemAdapter; import com.autotune.analyzer.serviceObjects.FailedUpdateResultsAPIObject; +import com.autotune.analyzer.utils.AnalyzerConstants; import com.autotune.analyzer.utils.GsonUTCDateAdapter; import com.google.gson.Gson; import com.google.gson.GsonBuilder; @@ -56,6 +58,7 @@ public void handle(String target, Request baseRequest, HttpServletRequest reques .disableHtmlEscaping() .enableComplexMapKeySerialization() .registerTypeAdapter(Date.class, new GsonUTCDateAdapter()) + .registerTypeAdapter(AnalyzerConstants.RecommendationItem.class, new RecommendationItemAdapter()) .create(); String gsonStr = gsonObj.toJson(new KruizeResponse(origMessage, errorCode, "", "ERROR", myList)); diff --git a/src/main/java/com/autotune/analyzer/kruizeObject/CreateExperimentConfigBean.java b/src/main/java/com/autotune/analyzer/kruizeObject/CreateExperimentConfigBean.java new file mode 100644 index 000000000..5303441f6 --- /dev/null +++ b/src/main/java/com/autotune/analyzer/kruizeObject/CreateExperimentConfigBean.java @@ -0,0 +1,111 @@ +/******************************************************************************* + * Copyright (c) 2022, 2022 Red Hat, IBM Corporation and others. + * + * Licensed under the Apache License, Version 2.0 (the "License"); + * you may not use this file except in compliance with the License. + * You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + *******************************************************************************/ +package com.autotune.analyzer.kruizeObject; + +/** + * THis is a placeholder class for bulkAPI createExperiment template to store defaults + */ +public class CreateExperimentConfigBean { + + // Private fields + private String mode; + private String target; + private String version; + private String datasourceName; + private String performanceProfile; + private double threshold; + private String measurementDurationStr; + private int measurementDuration; + + // Getters and Setters + public String getMode() { + return mode; + } + + public void setMode(String mode) { + this.mode = mode; + } + + public String getTarget() { + return target; + } + + public void setTarget(String target) { + this.target = target; + } + + public String getVersion() { + return version; + } + + public void setVersion(String version) { + this.version = version; + } + + public String getDatasourceName() { + return datasourceName; + } + + public void setDatasourceName(String datasourceName) { + this.datasourceName = datasourceName; + } + + public String getPerformanceProfile() { + return performanceProfile; + } + + public void setPerformanceProfile(String performanceProfile) { + this.performanceProfile = performanceProfile; + } + + public double getThreshold() { + return threshold; + } + + public void setThreshold(double threshold) { + this.threshold = threshold; + } + + public String getMeasurementDurationStr() { + return measurementDurationStr; + } + + public void setMeasurementDurationStr(String measurementDurationStr) { + this.measurementDurationStr = measurementDurationStr; + } + + public int getMeasurementDuration() { + return measurementDuration; + } + + public void setMeasurementDuration(int measurementDuration) { + this.measurementDuration = measurementDuration; + } + + @Override + public String toString() { + return "MonitoringConfigBean{" + + "mode='" + mode + '\'' + + ", target='" + target + '\'' + + ", version='" + version + '\'' + + ", datasourceName='" + datasourceName + '\'' + + ", performanceProfile='" + performanceProfile + '\'' + + ", threshold=" + threshold + + ", measurementDurationStr='" + measurementDurationStr + '\'' + + ", measurementDuration=" + measurementDuration + + '}'; + } +} diff --git a/src/main/java/com/autotune/analyzer/recommendations/RecommendationConstants.java b/src/main/java/com/autotune/analyzer/recommendations/RecommendationConstants.java index 4cc4be488..d708331e9 100644 --- a/src/main/java/com/autotune/analyzer/recommendations/RecommendationConstants.java +++ b/src/main/java/com/autotune/analyzer/recommendations/RecommendationConstants.java @@ -738,6 +738,8 @@ public static class PercentileConstants { public static final Integer TWENTYFIVE_PERCENTILE = 25; public static final Integer SEVENTYFIVE_PERCENTILE = 75; public static final Integer FIFTY_PERCENTILE = 50; + public static final Integer COST_ACCELERATOR_PERCENTILE = 60; + public static final Integer PERFORMANCE_ACCELERATOR_PERCENTILE = 98; } } } diff --git a/src/main/java/com/autotune/analyzer/recommendations/engine/RecommendationEngine.java b/src/main/java/com/autotune/analyzer/recommendations/engine/RecommendationEngine.java index bb9a202be..86ea2ebe1 100644 --- a/src/main/java/com/autotune/analyzer/recommendations/engine/RecommendationEngine.java +++ b/src/main/java/com/autotune/analyzer/recommendations/engine/RecommendationEngine.java @@ -17,18 +17,14 @@ import com.autotune.analyzer.recommendations.utils.RecommendationUtils; import com.autotune.analyzer.utils.AnalyzerConstants; import com.autotune.analyzer.utils.AnalyzerErrorConstants; -import com.autotune.analyzer.utils.ExperimentTypeUtil; import com.autotune.common.data.ValidationOutputData; -import com.autotune.common.data.metrics.AggregationFunctions; -import com.autotune.common.data.metrics.Metric; -import com.autotune.common.data.metrics.MetricAggregationInfoResults; -import com.autotune.common.data.metrics.MetricResults; +import com.autotune.common.data.metrics.*; import com.autotune.common.data.result.ContainerData; import com.autotune.common.data.result.IntervalResults; import com.autotune.common.data.result.NamespaceData; +import com.autotune.common.data.system.info.device.DeviceDetails; +import com.autotune.common.data.system.info.device.accelerator.AcceleratorDeviceData; import com.autotune.common.datasource.DataSourceInfo; -import com.autotune.common.auth.AuthenticationStrategy; -import com.autotune.common.auth.AuthenticationStrategyFactory; import com.autotune.common.exceptions.DataSourceNotExist; import com.autotune.common.k8sObjects.K8sObject; import com.autotune.common.utils.CommonUtils; @@ -435,12 +431,12 @@ RecommendationConfigItem>> getCurrentConfigData(ContainerData containerData, Tim if (null == configItem) continue; if (null == configItem.getAmount()) { - if (recommendationItem.equals(AnalyzerConstants.RecommendationItem.cpu)) { + if (recommendationItem.equals(AnalyzerConstants.RecommendationItem.CPU)) { notifications.add(RecommendationConstants.RecommendationNotification.ERROR_AMOUNT_MISSING_IN_CPU_SECTION); LOGGER.error(RecommendationConstants.RecommendationNotificationMsgConstant.AMOUNT_MISSING_IN_CPU_SECTION .concat(String.format(AnalyzerErrorConstants.AutotuneObjectErrors.EXPERIMENT_AND_INTERVAL_END_TIME, experimentName, interval_end_time))); - } else if (recommendationItem.equals((AnalyzerConstants.RecommendationItem.memory))) { + } else if (recommendationItem.equals((AnalyzerConstants.RecommendationItem.MEMORY))) { notifications.add(RecommendationConstants.RecommendationNotification.ERROR_AMOUNT_MISSING_IN_MEMORY_SECTION); LOGGER.error(RecommendationConstants.RecommendationNotificationMsgConstant.AMOUNT_MISSING_IN_MEMORY_SECTION .concat(String.format(AnalyzerErrorConstants.AutotuneObjectErrors.EXPERIMENT_AND_INTERVAL_END_TIME, @@ -449,12 +445,12 @@ RecommendationConfigItem>> getCurrentConfigData(ContainerData containerData, Tim continue; } if (null == configItem.getFormat()) { - if (recommendationItem.equals(AnalyzerConstants.RecommendationItem.cpu)) { + if (recommendationItem.equals(AnalyzerConstants.RecommendationItem.CPU)) { notifications.add(RecommendationConstants.RecommendationNotification.ERROR_FORMAT_MISSING_IN_CPU_SECTION); LOGGER.error(RecommendationConstants.RecommendationNotificationMsgConstant.FORMAT_MISSING_IN_CPU_SECTION .concat(String.format(AnalyzerErrorConstants.AutotuneObjectErrors.EXPERIMENT_AND_INTERVAL_END_TIME, experimentName, interval_end_time))); - } else if (recommendationItem.equals((AnalyzerConstants.RecommendationItem.memory))) { + } else if (recommendationItem.equals((AnalyzerConstants.RecommendationItem.MEMORY))) { notifications.add(RecommendationConstants.RecommendationNotification.ERROR_FORMAT_MISSING_IN_MEMORY_SECTION); LOGGER.error(RecommendationConstants.RecommendationNotificationMsgConstant.FORMAT_MISSING_IN_MEMORY_SECTION .concat(String.format(AnalyzerErrorConstants.AutotuneObjectErrors.EXPERIMENT_AND_INTERVAL_END_TIME, @@ -463,12 +459,12 @@ RecommendationConfigItem>> getCurrentConfigData(ContainerData containerData, Tim continue; } if (configItem.getAmount() <= 0.0) { - if (recommendationItem.equals(AnalyzerConstants.RecommendationItem.cpu)) { + if (recommendationItem.equals(AnalyzerConstants.RecommendationItem.CPU)) { notifications.add(RecommendationConstants.RecommendationNotification.ERROR_INVALID_AMOUNT_IN_CPU_SECTION); LOGGER.error(RecommendationConstants.RecommendationNotificationMsgConstant.INVALID_AMOUNT_IN_CPU_SECTION .concat(String.format(AnalyzerErrorConstants.AutotuneObjectErrors.EXPERIMENT_AND_INTERVAL_END_TIME, experimentName, interval_end_time))); - } else if (recommendationItem.equals((AnalyzerConstants.RecommendationItem.memory))) { + } else if (recommendationItem.equals((AnalyzerConstants.RecommendationItem.MEMORY))) { notifications.add(RecommendationConstants.RecommendationNotification.ERROR_INVALID_AMOUNT_IN_MEMORY_SECTION); LOGGER.error(RecommendationConstants.RecommendationNotificationMsgConstant.INVALID_AMOUNT_IN_MEMORY_SECTION .concat(String.format(AnalyzerErrorConstants.AutotuneObjectErrors.EXPERIMENT_AND_INTERVAL_END_TIME, @@ -477,12 +473,12 @@ RecommendationConfigItem>> getCurrentConfigData(ContainerData containerData, Tim continue; } if (configItem.getFormat().isEmpty() || configItem.getFormat().isBlank()) { - if (recommendationItem.equals(AnalyzerConstants.RecommendationItem.cpu)) { + if (recommendationItem.equals(AnalyzerConstants.RecommendationItem.CPU)) { notifications.add(RecommendationConstants.RecommendationNotification.ERROR_INVALID_FORMAT_IN_CPU_SECTION); LOGGER.error(RecommendationConstants.RecommendationNotificationMsgConstant.INVALID_FORMAT_IN_CPU_SECTION .concat(String.format(AnalyzerErrorConstants.AutotuneObjectErrors.EXPERIMENT_AND_INTERVAL_END_TIME, experimentName, interval_end_time))); - } else if (recommendationItem.equals((AnalyzerConstants.RecommendationItem.memory))) { + } else if (recommendationItem.equals((AnalyzerConstants.RecommendationItem.MEMORY))) { notifications.add(RecommendationConstants.RecommendationNotification.ERROR_INVALID_FORMAT_IN_MEMORY_SECTION); LOGGER.error(RecommendationConstants.RecommendationNotificationMsgConstant.INVALID_FORMAT_IN_MEMORY_SECTION .concat(String.format(AnalyzerErrorConstants.AutotuneObjectErrors.EXPERIMENT_AND_INTERVAL_END_TIME, @@ -668,20 +664,20 @@ private MappedRecommendationForModel generateRecommendationBasedOnModel(Timestam if (currentConfigMap.containsKey(AnalyzerConstants.ResourceSetting.requests) && null != currentConfigMap.get(AnalyzerConstants.ResourceSetting.requests)) { HashMap requestsMap = currentConfigMap.get(AnalyzerConstants.ResourceSetting.requests); - if (requestsMap.containsKey(AnalyzerConstants.RecommendationItem.cpu) && null != requestsMap.get(AnalyzerConstants.RecommendationItem.cpu)) { - currentCPURequest = requestsMap.get(AnalyzerConstants.RecommendationItem.cpu); + if (requestsMap.containsKey(AnalyzerConstants.RecommendationItem.CPU) && null != requestsMap.get(AnalyzerConstants.RecommendationItem.CPU)) { + currentCPURequest = requestsMap.get(AnalyzerConstants.RecommendationItem.CPU); } - if (requestsMap.containsKey(AnalyzerConstants.RecommendationItem.memory) && null != requestsMap.get(AnalyzerConstants.RecommendationItem.memory)) { - currentMemRequest = requestsMap.get(AnalyzerConstants.RecommendationItem.memory); + if (requestsMap.containsKey(AnalyzerConstants.RecommendationItem.MEMORY) && null != requestsMap.get(AnalyzerConstants.RecommendationItem.MEMORY)) { + currentMemRequest = requestsMap.get(AnalyzerConstants.RecommendationItem.MEMORY); } } if (currentConfigMap.containsKey(AnalyzerConstants.ResourceSetting.limits) && null != currentConfigMap.get(AnalyzerConstants.ResourceSetting.limits)) { HashMap limitsMap = currentConfigMap.get(AnalyzerConstants.ResourceSetting.limits); - if (limitsMap.containsKey(AnalyzerConstants.RecommendationItem.cpu) && null != limitsMap.get(AnalyzerConstants.RecommendationItem.cpu)) { - currentCPULimit = limitsMap.get(AnalyzerConstants.RecommendationItem.cpu); + if (limitsMap.containsKey(AnalyzerConstants.RecommendationItem.CPU) && null != limitsMap.get(AnalyzerConstants.RecommendationItem.CPU)) { + currentCPULimit = limitsMap.get(AnalyzerConstants.RecommendationItem.CPU); } - if (limitsMap.containsKey(AnalyzerConstants.RecommendationItem.memory) && null != limitsMap.get(AnalyzerConstants.RecommendationItem.memory)) { - currentMemLimit = limitsMap.get(AnalyzerConstants.RecommendationItem.memory); + if (limitsMap.containsKey(AnalyzerConstants.RecommendationItem.MEMORY) && null != limitsMap.get(AnalyzerConstants.RecommendationItem.MEMORY)) { + currentMemLimit = limitsMap.get(AnalyzerConstants.RecommendationItem.MEMORY); } } if (null != monitoringStartTime) { @@ -702,6 +698,7 @@ private MappedRecommendationForModel generateRecommendationBasedOnModel(Timestam // Get the Recommendation Items RecommendationConfigItem recommendationCpuRequest = model.getCPURequestRecommendation(filteredResultsMap, notifications); RecommendationConfigItem recommendationMemRequest = model.getMemoryRequestRecommendation(filteredResultsMap, notifications); + Map recommendationAcceleratorRequestMap = model.getAcceleratorRequestRecommendation(filteredResultsMap, notifications); // Get the Recommendation Items // Calling requests on limits as we are maintaining limits and requests as same @@ -732,7 +729,8 @@ private MappedRecommendationForModel generateRecommendationBasedOnModel(Timestam internalMapToPopulate, numPods, cpuThreshold, - memoryThreshold + memoryThreshold, + recommendationAcceleratorRequestMap ); } else { RecommendationNotification notification = new RecommendationNotification( @@ -826,40 +824,40 @@ private HashMap requestsMap = currentNamespaceConfigMap.get(AnalyzerConstants.ResourceSetting.requests); - if (requestsMap.containsKey(AnalyzerConstants.RecommendationItem.cpu) && null != requestsMap.get(AnalyzerConstants.RecommendationItem.cpu)) { - currentNamespaceCPURequest = requestsMap.get(AnalyzerConstants.RecommendationItem.cpu); + if (requestsMap.containsKey(AnalyzerConstants.RecommendationItem.CPU) && null != requestsMap.get(AnalyzerConstants.RecommendationItem.CPU)) { + currentNamespaceCPURequest = requestsMap.get(AnalyzerConstants.RecommendationItem.CPU); } - if (requestsMap.containsKey(AnalyzerConstants.RecommendationItem.memory) && null != requestsMap.get(AnalyzerConstants.RecommendationItem.memory)) { - currentNamespaceMemRequest = requestsMap.get(AnalyzerConstants.RecommendationItem.memory); + if (requestsMap.containsKey(AnalyzerConstants.RecommendationItem.MEMORY) && null != requestsMap.get(AnalyzerConstants.RecommendationItem.MEMORY)) { + currentNamespaceMemRequest = requestsMap.get(AnalyzerConstants.RecommendationItem.MEMORY); } } if (currentNamespaceConfigMap.containsKey(AnalyzerConstants.ResourceSetting.limits) && null != currentNamespaceConfigMap.get(AnalyzerConstants.ResourceSetting.limits)) { HashMap limitsMap = currentNamespaceConfigMap.get(AnalyzerConstants.ResourceSetting.limits); - if (limitsMap.containsKey(AnalyzerConstants.RecommendationItem.cpu) && null != limitsMap.get(AnalyzerConstants.RecommendationItem.cpu)) { - currentNamespaceCPULimit = limitsMap.get(AnalyzerConstants.RecommendationItem.cpu); + if (limitsMap.containsKey(AnalyzerConstants.RecommendationItem.CPU) && null != limitsMap.get(AnalyzerConstants.RecommendationItem.CPU)) { + currentNamespaceCPULimit = limitsMap.get(AnalyzerConstants.RecommendationItem.CPU); } - if (limitsMap.containsKey(AnalyzerConstants.RecommendationItem.memory) && null != limitsMap.get(AnalyzerConstants.RecommendationItem.memory)) { - currentNamespaceMemLimit = limitsMap.get(AnalyzerConstants.RecommendationItem.memory); + if (limitsMap.containsKey(AnalyzerConstants.RecommendationItem.MEMORY) && null != limitsMap.get(AnalyzerConstants.RecommendationItem.MEMORY)) { + currentNamespaceMemLimit = limitsMap.get(AnalyzerConstants.RecommendationItem.MEMORY); } } if (null != monitoringStartTime) { @@ -1081,7 +1079,8 @@ private MappedRecommendationForModel generateNamespaceRecommendationBasedOnModel internalMapToPopulate, numPodsInNamespace, namespaceCpuThreshold, - namespaceMemoryThreshold + namespaceMemoryThreshold, + null ); } else { RecommendationNotification notification = new RecommendationNotification( @@ -1104,13 +1103,17 @@ private MappedRecommendationForModel generateNamespaceRecommendationBasedOnModel * @param numPods The number of pods to consider for the recommendation. * @param cpuThreshold The CPU usage threshold for the recommendation. * @param memoryThreshold The memory usage threshold for the recommendation. + * @param recommendationAcceleratorRequestMap The Map which has Accelerator recommendations * @return {@code true} if the internal map was successfully populated; {@code false} otherwise. */ private boolean populateRecommendation(Map.Entry termEntry, MappedRecommendationForModel recommendationModel, ArrayList notifications, HashMap internalMapToPopulate, - int numPods, double cpuThreshold, double memoryThreshold) { + int numPods, + double cpuThreshold, + double memoryThreshold, + Map recommendationAcceleratorRequestMap) { // Check for cpu & memory Thresholds (Duplicate check if the caller is generateRecommendations) String recommendationTerm = termEntry.getKey(); double hours = termEntry.getValue().getDays() * KruizeConstants.TimeConv.NO_OF_HOURS_PER_DAY * KruizeConstants.TimeConv. @@ -1273,7 +1276,7 @@ private boolean populateRecommendation(Map.Entry termEntry, generatedCpuRequestFormat = recommendationCpuRequest.getFormat(); if (null != generatedCpuRequestFormat && !generatedCpuRequestFormat.isEmpty()) { isRecommendedCPURequestAvailable = true; - requestsMap.put(AnalyzerConstants.RecommendationItem.cpu, recommendationCpuRequest); + requestsMap.put(AnalyzerConstants.RecommendationItem.CPU, recommendationCpuRequest); } else { RecommendationNotification recommendationNotification = new RecommendationNotification(RecommendationConstants.RecommendationNotification.ERROR_FORMAT_MISSING_IN_CPU_SECTION); notifications.add(recommendationNotification); @@ -1289,7 +1292,7 @@ private boolean populateRecommendation(Map.Entry termEntry, generatedMemRequestFormat = recommendationMemRequest.getFormat(); if (null != generatedMemRequestFormat && !generatedMemRequestFormat.isEmpty()) { isRecommendedMemoryRequestAvailable = true; - requestsMap.put(AnalyzerConstants.RecommendationItem.memory, recommendationMemRequest); + requestsMap.put(AnalyzerConstants.RecommendationItem.MEMORY, recommendationMemRequest); } else { RecommendationNotification recommendationNotification = new RecommendationNotification(RecommendationConstants.RecommendationNotification.ERROR_FORMAT_MISSING_IN_MEMORY_SECTION); notifications.add(recommendationNotification); @@ -1325,7 +1328,7 @@ private boolean populateRecommendation(Map.Entry termEntry, generatedCpuLimitFormat = recommendationCpuLimits.getFormat(); if (null != generatedCpuLimitFormat && !generatedCpuLimitFormat.isEmpty()) { isRecommendedCPULimitAvailable = true; - limitsMap.put(AnalyzerConstants.RecommendationItem.cpu, recommendationCpuLimits); + limitsMap.put(AnalyzerConstants.RecommendationItem.CPU, recommendationCpuLimits); } else { RecommendationNotification recommendationNotification = new RecommendationNotification(RecommendationConstants.RecommendationNotification.ERROR_FORMAT_MISSING_IN_CPU_SECTION); notifications.add(recommendationNotification); @@ -1341,7 +1344,7 @@ private boolean populateRecommendation(Map.Entry termEntry, generatedMemLimitFormat = recommendationMemLimits.getFormat(); if (null != generatedMemLimitFormat && !generatedMemLimitFormat.isEmpty()) { isRecommendedMemoryLimitAvailable = true; - limitsMap.put(AnalyzerConstants.RecommendationItem.memory, recommendationMemLimits); + limitsMap.put(AnalyzerConstants.RecommendationItem.MEMORY, recommendationMemLimits); } else { RecommendationNotification recommendationNotification = new RecommendationNotification(RecommendationConstants.RecommendationNotification.ERROR_FORMAT_MISSING_IN_MEMORY_SECTION); notifications.add(recommendationNotification); @@ -1373,7 +1376,7 @@ private boolean populateRecommendation(Map.Entry termEntry, experimentName, interval_end_time))); } else { isCurrentCPURequestAvailable = true; - currentRequestsMap.put(AnalyzerConstants.RecommendationItem.cpu, currentCpuRequest); + currentRequestsMap.put(AnalyzerConstants.RecommendationItem.CPU, currentCpuRequest); } } @@ -1393,7 +1396,7 @@ private boolean populateRecommendation(Map.Entry termEntry, experimentName, interval_end_time))); } else { isCurrentMemoryRequestAvailable = true; - currentRequestsMap.put(AnalyzerConstants.RecommendationItem.memory, currentMemRequest); + currentRequestsMap.put(AnalyzerConstants.RecommendationItem.MEMORY, currentMemRequest); } } @@ -1416,7 +1419,7 @@ private boolean populateRecommendation(Map.Entry termEntry, experimentName, interval_end_time))); } else { isCurrentCPULimitAvailable = true; - currentLimitsMap.put(AnalyzerConstants.RecommendationItem.cpu, currentCpuLimit); + currentLimitsMap.put(AnalyzerConstants.RecommendationItem.CPU, currentCpuLimit); } } @@ -1436,7 +1439,7 @@ private boolean populateRecommendation(Map.Entry termEntry, experimentName, interval_end_time))); } else { isCurrentMemoryLimitAvailable = true; - currentLimitsMap.put(AnalyzerConstants.RecommendationItem.memory, currentMemLimit); + currentLimitsMap.put(AnalyzerConstants.RecommendationItem.MEMORY, currentMemLimit); } } @@ -1454,7 +1457,7 @@ private boolean populateRecommendation(Map.Entry termEntry, // TODO: If difference is positive it can be considered as under-provisioning, Need to handle it better isVariationCPURequestAvailable = true; variationCpuRequest = new RecommendationConfigItem(diff, generatedCpuRequestFormat); - requestsVariationMap.put(AnalyzerConstants.RecommendationItem.cpu, variationCpuRequest); + requestsVariationMap.put(AnalyzerConstants.RecommendationItem.CPU, variationCpuRequest); } double currentMemRequestValue = 0.0; @@ -1466,7 +1469,7 @@ private boolean populateRecommendation(Map.Entry termEntry, // TODO: If difference is positive it can be considered as under-provisioning, Need to handle it better isVariationMemoryRequestAvailable = true; variationMemRequest = new RecommendationConfigItem(diff, generatedMemRequestFormat); - requestsVariationMap.put(AnalyzerConstants.RecommendationItem.memory, variationMemRequest); + requestsVariationMap.put(AnalyzerConstants.RecommendationItem.MEMORY, variationMemRequest); } // Create a new map for storing variation in limits @@ -1483,7 +1486,7 @@ private boolean populateRecommendation(Map.Entry termEntry, double diff = generatedCpuLimit - currentCpuLimitValue; isVariationCPULimitAvailable = true; variationCpuLimit = new RecommendationConfigItem(diff, generatedCpuLimitFormat); - limitsVariationMap.put(AnalyzerConstants.RecommendationItem.cpu, variationCpuLimit); + limitsVariationMap.put(AnalyzerConstants.RecommendationItem.CPU, variationCpuLimit); } double currentMemLimitValue = 0.0; @@ -1494,7 +1497,7 @@ private boolean populateRecommendation(Map.Entry termEntry, double diff = generatedMemLimit - currentMemLimitValue; isVariationMemoryLimitAvailable = true; variationMemLimit = new RecommendationConfigItem(diff, generatedMemLimitFormat); - limitsVariationMap.put(AnalyzerConstants.RecommendationItem.memory, variationMemLimit); + limitsVariationMap.put(AnalyzerConstants.RecommendationItem.MEMORY, variationMemLimit); } // build the engine level notifications here @@ -1535,23 +1538,23 @@ private boolean populateRecommendation(Map.Entry termEntry, // Alternative - CPU REQUEST VALUE // Accessing existing recommendation item - RecommendationConfigItem tempAccessedRecCPURequest = requestsMap.get(AnalyzerConstants.RecommendationItem.cpu); + RecommendationConfigItem tempAccessedRecCPURequest = requestsMap.get(AnalyzerConstants.RecommendationItem.CPU); if (null != tempAccessedRecCPURequest) { // Updating it with desired value tempAccessedRecCPURequest.setAmount(currentCpuRequestValue); } // Replace the updated object (Step not needed as we are updating existing object, but just to make sure it's updated) - requestsMap.put(AnalyzerConstants.RecommendationItem.cpu, tempAccessedRecCPURequest); + requestsMap.put(AnalyzerConstants.RecommendationItem.CPU, tempAccessedRecCPURequest); // Alternative - CPU REQUEST VARIATION VALUE // Accessing existing recommendation item - RecommendationConfigItem tempAccessedRecCPURequestVariation = requestsVariationMap.get(AnalyzerConstants.RecommendationItem.cpu); + RecommendationConfigItem tempAccessedRecCPURequestVariation = requestsVariationMap.get(AnalyzerConstants.RecommendationItem.CPU); if (null != tempAccessedRecCPURequestVariation) { // Updating it with desired value (as we are setting to current variation would be 0) tempAccessedRecCPURequestVariation.setAmount(CPU_ZERO); } // Replace the updated object (Step not needed as we are updating existing object, but just to make sure it's updated) - requestsVariationMap.put(AnalyzerConstants.RecommendationItem.cpu, tempAccessedRecCPURequestVariation); + requestsVariationMap.put(AnalyzerConstants.RecommendationItem.CPU, tempAccessedRecCPURequestVariation); RecommendationNotification recommendationNotification = new RecommendationNotification(RecommendationConstants.RecommendationNotification.NOTICE_CPU_REQUESTS_OPTIMISED); engineNotifications.add(recommendationNotification); @@ -1575,23 +1578,23 @@ private boolean populateRecommendation(Map.Entry termEntry, // Alternative - CPU LIMIT VALUE // Accessing existing recommendation item - RecommendationConfigItem tempAccessedRecCPULimit = limitsMap.get(AnalyzerConstants.RecommendationItem.cpu); + RecommendationConfigItem tempAccessedRecCPULimit = limitsMap.get(AnalyzerConstants.RecommendationItem.CPU); if (null != tempAccessedRecCPULimit) { // Updating it with desired value tempAccessedRecCPULimit.setAmount(currentCpuLimitValue); } // Replace the updated object (Step not needed as we are updating existing object, but just to make sure it's updated) - limitsMap.put(AnalyzerConstants.RecommendationItem.cpu, tempAccessedRecCPULimit); + limitsMap.put(AnalyzerConstants.RecommendationItem.CPU, tempAccessedRecCPULimit); // Alternative - CPU LIMIT VARIATION VALUE // Accessing existing recommendation item - RecommendationConfigItem tempAccessedRecCPULimitVariation = limitsVariationMap.get(AnalyzerConstants.RecommendationItem.cpu); + RecommendationConfigItem tempAccessedRecCPULimitVariation = limitsVariationMap.get(AnalyzerConstants.RecommendationItem.CPU); if (null != tempAccessedRecCPULimitVariation) { // Updating it with desired value (as we are setting to current variation would be 0) tempAccessedRecCPULimitVariation.setAmount(CPU_ZERO); } // Replace the updated object (Step not needed as we are updating existing object, but just to make sure it's updated) - limitsVariationMap.put(AnalyzerConstants.RecommendationItem.cpu, tempAccessedRecCPULimitVariation); + limitsVariationMap.put(AnalyzerConstants.RecommendationItem.CPU, tempAccessedRecCPULimitVariation); RecommendationNotification recommendationNotification = new RecommendationNotification(RecommendationConstants.RecommendationNotification.NOTICE_CPU_LIMITS_OPTIMISED); engineNotifications.add(recommendationNotification); @@ -1615,23 +1618,23 @@ private boolean populateRecommendation(Map.Entry termEntry, // Alternative - MEMORY REQUEST VALUE // Accessing existing recommendation item - RecommendationConfigItem tempAccessedRecMemoryRequest = requestsMap.get(AnalyzerConstants.RecommendationItem.memory); + RecommendationConfigItem tempAccessedRecMemoryRequest = requestsMap.get(AnalyzerConstants.RecommendationItem.MEMORY); if (null != tempAccessedRecMemoryRequest) { // Updating it with desired value tempAccessedRecMemoryRequest.setAmount(currentMemRequestValue); } // Replace the updated object (Step not needed as we are updating existing object, but just to make sure it's updated) - requestsMap.put(AnalyzerConstants.RecommendationItem.memory, tempAccessedRecMemoryRequest); + requestsMap.put(AnalyzerConstants.RecommendationItem.MEMORY, tempAccessedRecMemoryRequest); // Alternative - MEMORY REQUEST VARIATION VALUE // Accessing existing recommendation item - RecommendationConfigItem tempAccessedRecMemoryRequestVariation = requestsVariationMap.get(AnalyzerConstants.RecommendationItem.memory); + RecommendationConfigItem tempAccessedRecMemoryRequestVariation = requestsVariationMap.get(AnalyzerConstants.RecommendationItem.MEMORY); if (null != tempAccessedRecMemoryRequestVariation) { // Updating it with desired value (as we are setting to current variation would be 0) tempAccessedRecMemoryRequestVariation.setAmount(MEM_ZERO); } // Replace the updated object (Step not needed as we are updating existing object, but just to make sure it's updated) - requestsVariationMap.put(AnalyzerConstants.RecommendationItem.memory, tempAccessedRecMemoryRequestVariation); + requestsVariationMap.put(AnalyzerConstants.RecommendationItem.MEMORY, tempAccessedRecMemoryRequestVariation); RecommendationNotification recommendationNotification = new RecommendationNotification(RecommendationConstants.RecommendationNotification.NOTICE_MEMORY_REQUESTS_OPTIMISED); engineNotifications.add(recommendationNotification); @@ -1655,23 +1658,23 @@ private boolean populateRecommendation(Map.Entry termEntry, // Alternative - MEMORY LIMIT VALUE // Accessing existing recommendation item - RecommendationConfigItem tempAccessedRecMemoryLimit = limitsMap.get(AnalyzerConstants.RecommendationItem.memory); + RecommendationConfigItem tempAccessedRecMemoryLimit = limitsMap.get(AnalyzerConstants.RecommendationItem.MEMORY); if (null != tempAccessedRecMemoryLimit) { // Updating it with desired value tempAccessedRecMemoryLimit.setAmount(currentMemLimitValue); } // Replace the updated object (Step not needed as we are updating existing object, but just to make sure it's updated) - limitsMap.put(AnalyzerConstants.RecommendationItem.memory, tempAccessedRecMemoryLimit); + limitsMap.put(AnalyzerConstants.RecommendationItem.MEMORY, tempAccessedRecMemoryLimit); // Alternative - MEMORY LIMIT VARIATION VALUE // Accessing existing recommendation item - RecommendationConfigItem tempAccessedRecMemoryLimitVariation = limitsVariationMap.get(AnalyzerConstants.RecommendationItem.memory); + RecommendationConfigItem tempAccessedRecMemoryLimitVariation = limitsVariationMap.get(AnalyzerConstants.RecommendationItem.MEMORY); if (null != tempAccessedRecMemoryLimitVariation) { // Updating it with desired value (as we are setting to current variation would be 0) tempAccessedRecMemoryLimitVariation.setAmount(MEM_ZERO); } // Replace the updated object (Step not needed as we are updating existing object, but just to make sure it's updated) - limitsVariationMap.put(AnalyzerConstants.RecommendationItem.memory, tempAccessedRecMemoryLimitVariation); + limitsVariationMap.put(AnalyzerConstants.RecommendationItem.MEMORY, tempAccessedRecMemoryLimitVariation); RecommendationNotification recommendationNotification = new RecommendationNotification(RecommendationConstants.RecommendationNotification.NOTICE_MEMORY_LIMITS_OPTIMISED); engineNotifications.add(recommendationNotification); @@ -1694,6 +1697,11 @@ private boolean populateRecommendation(Map.Entry termEntry, config.put(AnalyzerConstants.ResourceSetting.requests, requestsMap); } + // Check if accelerator map is not empty and add to limits map + if (null != recommendationAcceleratorRequestMap && !recommendationAcceleratorRequestMap.isEmpty()) { + limitsMap.putAll(recommendationAcceleratorRequestMap); + } + // Set Limits Map if (!limitsMap.isEmpty()) { config.put(AnalyzerConstants.ResourceSetting.limits, limitsMap); @@ -1808,9 +1816,17 @@ public void fetchMetricsBasedOnProfileAndDatasource(KruizeObject kruizeObject, T } String maxDateQuery = null; + String acceleratorDetectionQuery = null; if (kruizeObject.isContainerExperiment()) { maxDateQuery = getMaxDateQuery(metricProfile, AnalyzerConstants.MetricName.maxDate.name()); - fetchContainerMetricsBasedOnDataSourceAndProfile(kruizeObject, interval_end_time, interval_start_time, dataSourceInfo, metricProfile, maxDateQuery); + acceleratorDetectionQuery = getMaxDateQuery(metricProfile, AnalyzerConstants.MetricName.gpuMemoryUsage.name()); + fetchContainerMetricsBasedOnDataSourceAndProfile(kruizeObject, + interval_end_time, + interval_start_time, + dataSourceInfo, + metricProfile, + maxDateQuery, + acceleratorDetectionQuery); } else if (kruizeObject.isNamespaceExperiment()) { maxDateQuery = getMaxDateQuery(metricProfile, AnalyzerConstants.MetricName.namespaceMaxDate.name()); fetchNamespaceMetricsBasedOnDataSourceAndProfile(kruizeObject, interval_end_time, interval_start_time, dataSourceInfo, metricProfile, maxDateQuery); @@ -1897,9 +1913,8 @@ private void fetchNamespaceMetricsBasedOnDataSourceAndProfile(KruizeObject kruiz k8sObject.setNamespaceData(namespaceData); } - List namespaceMetricList = metricProfile.getSloInfo().getFunctionVariables().stream() - .filter(metricEntry -> metricEntry.getName().startsWith(AnalyzerConstants.NAMESPACE) && !metricEntry.getName().equals("namespaceMaxDate")) - .toList(); + List namespaceMetricList = filterMetricsBasedOnExpTypeAndK8sObject(metricProfile, + AnalyzerConstants.MetricName.namespaceMaxDate.name(), kruizeObject.getExperimentType()); // Iterate over metrics and aggregation functions for (Metric metricEntry : namespaceMetricList) { @@ -1978,7 +1993,7 @@ private void fetchNamespaceMetricsBasedOnDataSourceAndProfile(KruizeObject kruiz /** - * Fetches namespace metrics based on the specified datasource using queries from the metricProfile for the given time interval. + * Fetches Container metrics based on the specified datasource using queries from the metricProfile for the given time interval. * * @param kruizeObject KruizeObject * @param interval_end_time The end time of the interval in the format yyyy-MM-ddTHH:mm:sssZ @@ -1988,7 +2003,13 @@ private void fetchNamespaceMetricsBasedOnDataSourceAndProfile(KruizeObject kruiz * @param maxDateQuery max date query for containers * @throws Exception */ - private void fetchContainerMetricsBasedOnDataSourceAndProfile(KruizeObject kruizeObject, Timestamp interval_end_time, Timestamp interval_start_time, DataSourceInfo dataSourceInfo, PerformanceProfile metricProfile, String maxDateQuery) throws Exception, FetchMetricsError { + private void fetchContainerMetricsBasedOnDataSourceAndProfile(KruizeObject kruizeObject, + Timestamp interval_end_time, + Timestamp interval_start_time, + DataSourceInfo dataSourceInfo, + PerformanceProfile metricProfile, + String maxDateQuery, + String acceleratorDetectionQuery) throws Exception, FetchMetricsError { try { long interval_end_time_epoc = 0; long interval_start_time_epoc = 0; @@ -2007,6 +2028,20 @@ private void fetchContainerMetricsBasedOnDataSourceAndProfile(KruizeObject kruiz for (Map.Entry entry : containerDataMap.entrySet()) { ContainerData containerData = entry.getValue(); + + // Check if the container data has Accelerator support else check for Accelerator metrics + if (null == containerData.getContainerDeviceList() || !containerData.getContainerDeviceList().isAcceleratorDeviceDetected()) { + RecommendationUtils.markAcceleratorDeviceStatusToContainer(containerData, + maxDateQuery, + namespace, + workload, + workload_type, + dataSourceInfo, + kruizeObject.getTerms(), + measurementDurationMinutesInDouble, + acceleratorDetectionQuery); + } + String containerName = containerData.getContainer_name(); if (null == interval_end_time) { LOGGER.info(KruizeConstants.APIMessages.CONTAINER_USAGE_INFO); @@ -2058,20 +2093,47 @@ private void fetchContainerMetricsBasedOnDataSourceAndProfile(KruizeObject kruiz HashMap containerDataResults = new HashMap<>(); IntervalResults intervalResults = null; HashMap resMap = null; - HashMap resultMap = null; + HashMap acceleratorMetricResultHashMap; MetricResults metricResults = null; MetricAggregationInfoResults metricAggregationInfoResults = null; - List metricList = metricProfile.getSloInfo().getFunctionVariables(); + List metricList = filterMetricsBasedOnExpTypeAndK8sObject(metricProfile, + AnalyzerConstants.MetricName.maxDate.name(), kruizeObject.getExperimentType()); + List acceleratorFunctions = Arrays.asList( + AnalyzerConstants.MetricName.gpuCoreUsage.toString(), + AnalyzerConstants.MetricName.gpuMemoryUsage.toString() + ); // Iterate over metrics and aggregation functions for (Metric metricEntry : metricList) { + + boolean isAcceleratorMetric = false; + boolean fetchAcceleratorMetrics = false; + + if (acceleratorFunctions.contains(metricEntry.getName())) { + isAcceleratorMetric = true; + } + + if (isAcceleratorMetric + && null != containerData.getContainerDeviceList() + && containerData.getContainerDeviceList().isAcceleratorDeviceDetected()) { + fetchAcceleratorMetrics = true; + } + + // Skip fetching Accelerator metrics if the workload doesn't use Accelerator + if (isAcceleratorMetric && !fetchAcceleratorMetrics) + continue; + HashMap aggregationFunctions = metricEntry.getAggregationFunctionsMap(); for (Map.Entry aggregationFunctionsEntry: aggregationFunctions.entrySet()) { // Determine promQL query on metric type String promQL = aggregationFunctionsEntry.getValue().getQuery(); - String format = null; + // Skipping if the promQL is empty + if (null == promQL || promQL.isEmpty()) + continue; + + String format = null; // Determine format based on metric type - Todo move this metric profile List cpuFunction = Arrays.asList(AnalyzerConstants.MetricName.cpuUsage.toString(), AnalyzerConstants.MetricName.cpuThrottle.toString(), AnalyzerConstants.MetricName.cpuLimit.toString(), AnalyzerConstants.MetricName.cpuRequest.toString()); @@ -2080,8 +2142,11 @@ private void fetchContainerMetricsBasedOnDataSourceAndProfile(KruizeObject kruiz format = KruizeConstants.JSONKeys.CORES; } else if (memFunction.contains(metricEntry.getName())) { format = KruizeConstants.JSONKeys.BYTES; + } else if (isAcceleratorMetric) { + format = KruizeConstants.JSONKeys.CORES; } + // If promQL is determined, fetch metrics from the datasource promQL = promQL .replace(AnalyzerConstants.NAMESPACE_VARIABLE, namespace) .replace(AnalyzerConstants.CONTAINER_VARIABLE, containerName) @@ -2089,48 +2154,150 @@ private void fetchContainerMetricsBasedOnDataSourceAndProfile(KruizeObject kruiz .replace(AnalyzerConstants.WORKLOAD_VARIABLE, workload) .replace(AnalyzerConstants.WORKLOAD_TYPE_VARIABLE, workload_type); - // If promQL is determined, fetch metrics from the datasource - if (promQL != null) { - LOGGER.info(promQL); - String podMetricsUrl; - try { - podMetricsUrl = String.format(KruizeConstants.DataSourceConstants.DATASOURCE_ENDPOINT_WITH_QUERY, - dataSourceInfo.getUrl(), - URLEncoder.encode(promQL, CHARACTER_ENCODING), - interval_start_time_epoc, - interval_end_time_epoc, - measurementDurationMinutesInDouble.intValue() * KruizeConstants.TimeConv.NO_OF_SECONDS_PER_MINUTE); - LOGGER.info(podMetricsUrl); - client.setBaseURL(podMetricsUrl); - JSONObject genericJsonObject = client.fetchMetricsJson(KruizeConstants.APIMessages.GET, ""); - JsonObject jsonObject = new Gson().fromJson(genericJsonObject.toString(), JsonObject.class); - JsonArray resultArray = jsonObject.getAsJsonObject(KruizeConstants.JSONKeys.DATA).getAsJsonArray(KruizeConstants.DataSourceConstants.DataSourceQueryJSONKeys.RESULT); - // Process fetched metrics - if (null != resultArray && !resultArray.isEmpty()) { - resultArray = jsonObject.getAsJsonObject(KruizeConstants.JSONKeys.DATA).getAsJsonArray( - KruizeConstants.DataSourceConstants.DataSourceQueryJSONKeys.RESULT).get(0) - .getAsJsonObject().getAsJsonArray(KruizeConstants.DataSourceConstants - .DataSourceQueryJSONKeys.VALUES); - sdf.setTimeZone(TimeZone.getTimeZone(KruizeConstants.TimeUnitsExt.TimeZones.UTC)); + LOGGER.info(promQL); + String podMetricsUrl; + try { + podMetricsUrl = String.format(KruizeConstants.DataSourceConstants.DATASOURCE_ENDPOINT_WITH_QUERY, + dataSourceInfo.getUrl(), + URLEncoder.encode(promQL, CHARACTER_ENCODING), + interval_start_time_epoc, + interval_end_time_epoc, + measurementDurationMinutesInDouble.intValue() * KruizeConstants.TimeConv.NO_OF_SECONDS_PER_MINUTE); + LOGGER.info(podMetricsUrl); + client.setBaseURL(podMetricsUrl); + JSONObject genericJsonObject = client.fetchMetricsJson(KruizeConstants.APIMessages.GET, ""); + JsonObject jsonObject = new Gson().fromJson(genericJsonObject.toString(), JsonObject.class); + JsonArray resultArray = jsonObject.getAsJsonObject(KruizeConstants.JSONKeys.DATA).getAsJsonArray(KruizeConstants.DataSourceConstants.DataSourceQueryJSONKeys.RESULT); + // Skipping if Result array is null or empty + if (null == resultArray || resultArray.isEmpty()) + continue; + + // Process fetched metrics + if (isAcceleratorMetric){ + for (JsonElement result : resultArray) { + JsonObject resultObject = result.getAsJsonObject(); + JsonObject metricObject = resultObject.getAsJsonObject(KruizeConstants.JSONKeys.METRIC); + + // Set the data only for the container Accelerator device + if (null == metricObject.get(KruizeConstants.JSONKeys.MODEL_NAME).getAsString()) + continue; + if (metricObject.get(KruizeConstants.JSONKeys.MODEL_NAME).getAsString().isEmpty()) + continue; + + ArrayList deviceDetails = containerData.getContainerDeviceList().getDevices(AnalyzerConstants.DeviceType.ACCELERATOR); + // Continuing to next element + // All other elements will also fail as there is no Accelerator attached + // Theoretically, it doesn't fail, but the future implementations may change + // So adding a check after a function call to check it's return value is advisable + // TODO: Needs a check to figure out why devicelist is empty if is Accelerator detected is true + if (null == deviceDetails) + continue; + if (deviceDetails.isEmpty()) + continue; + + // Assuming only one MIG supported Accelerator is attached + // Needs to be changed when you support multiple Accelerator's + // Same changes need to be applied at the time of adding the device in + // DeviceHandler + DeviceDetails deviceDetail = deviceDetails.get(0); + AcceleratorDeviceData containerAcceleratorDeviceData = (AcceleratorDeviceData) deviceDetail; + + // Skip non-matching Accelerator entries + if (!metricObject.get(KruizeConstants.JSONKeys.MODEL_NAME).getAsString().equalsIgnoreCase(containerAcceleratorDeviceData.getModelName())) + continue; + + AcceleratorDeviceData acceleratorDeviceData = new AcceleratorDeviceData(metricObject.get(KruizeConstants.JSONKeys.MODEL_NAME).getAsString(), + metricObject.get(KruizeConstants.JSONKeys.HOSTNAME).getAsString(), + metricObject.get(KruizeConstants.JSONKeys.UUID).getAsString(), + metricObject.get(KruizeConstants.JSONKeys.DEVICE).getAsString(), + true); + + JsonArray valuesArray = resultObject.getAsJsonArray(KruizeConstants.DataSourceConstants + .DataSourceQueryJSONKeys.VALUES); + sdf.setTimeZone(TimeZone.getTimeZone(KruizeConstants.TimeUnitsExt.TimeZones.UTC)); // Iterate over fetched metrics Timestamp sTime = new Timestamp(interval_start_time_epoc); - for (JsonElement element : resultArray) { + for (JsonElement element : valuesArray) { JsonArray valueArray = element.getAsJsonArray(); long epochTime = valueArray.get(0).getAsLong(); double value = valueArray.get(1).getAsDouble(); String timestamp = sdf.format(new Date(epochTime * KruizeConstants.TimeConv.NO_OF_MSECS_IN_SEC)); Date date = sdf.parse(timestamp); - Timestamp eTime = new Timestamp(date.getTime()); + Timestamp tempTime = new Timestamp(date.getTime()); + Timestamp eTime = RecommendationUtils.getNearestTimestamp(containerDataResults, + tempTime, + AnalyzerConstants.AcceleratorConstants.AcceleratorMetricConstants.TIMESTAMP_RANGE_CHECK_IN_MINUTES); + + // containerDataResults are empty so will use the prometheus timestamp + if (null == eTime) { + // eTime = tempTime; + // Skipping entry, as inconsistency with CPU & memory records may provide null pointer while accessing metric results + // TODO: Need to seperate the data records of CPU and memory based on exporter + // TODO: Perform recommendation generation by stitching the outcome + continue; + } // Prepare interval results - prepareIntervalResults(containerDataResults, intervalResults, resMap, metricResults, - metricAggregationInfoResults, sTime, eTime, metricEntry, aggregationFunctionsEntry, value, format); + if (containerDataResults.containsKey(eTime)) { + intervalResults = containerDataResults.get(eTime); + acceleratorMetricResultHashMap = intervalResults.getAcceleratorMetricResultHashMap(); + if (null == acceleratorMetricResultHashMap) + acceleratorMetricResultHashMap = new HashMap<>(); + } else { + intervalResults = new IntervalResults(); + acceleratorMetricResultHashMap = new HashMap<>(); + } + AnalyzerConstants.MetricName metricName = AnalyzerConstants.MetricName.valueOf(metricEntry.getName()); + if (acceleratorMetricResultHashMap.containsKey(metricName)) { + metricResults = acceleratorMetricResultHashMap.get(metricName).getMetricResults(); + metricAggregationInfoResults = metricResults.getAggregationInfoResult(); + } else { + metricResults = new MetricResults(); + metricAggregationInfoResults = new MetricAggregationInfoResults(); + } + Method method = MetricAggregationInfoResults.class.getDeclaredMethod(KruizeConstants.APIMessages.SET + aggregationFunctionsEntry.getKey().substring(0, 1).toUpperCase() + aggregationFunctionsEntry.getKey().substring(1), Double.class); + method.invoke(metricAggregationInfoResults, value); + metricAggregationInfoResults.setFormat(format); + metricResults.setAggregationInfoResult(metricAggregationInfoResults); + metricResults.setName(String.valueOf(metricName)); + metricResults.setFormat(format); + AcceleratorMetricResult acceleratorMetricResult = new AcceleratorMetricResult(acceleratorDeviceData, metricResults); + acceleratorMetricResultHashMap.put(metricName, acceleratorMetricResult); + intervalResults.setAcceleratorMetricResultHashMap(acceleratorMetricResultHashMap); + intervalResults.setIntervalStartTime(sTime); //Todo this will change + intervalResults.setIntervalEndTime(eTime); + intervalResults.setDurationInMinutes((double) ((eTime.getTime() - sTime.getTime()) + / ((long) KruizeConstants.TimeConv.NO_OF_SECONDS_PER_MINUTE + * KruizeConstants.TimeConv.NO_OF_MSECS_IN_SEC))); + containerDataResults.put(eTime, intervalResults); + sTime = eTime; } } - } catch (Exception e) { - throw new RuntimeException(e); + } else { + resultArray = jsonObject.getAsJsonObject(KruizeConstants.JSONKeys.DATA).getAsJsonArray( + KruizeConstants.DataSourceConstants.DataSourceQueryJSONKeys.RESULT).get(0) + .getAsJsonObject().getAsJsonArray(KruizeConstants.DataSourceConstants + .DataSourceQueryJSONKeys.VALUES); + sdf.setTimeZone(TimeZone.getTimeZone(KruizeConstants.TimeUnitsExt.TimeZones.UTC)); + + // Iterate over fetched metrics + Timestamp sTime = new Timestamp(interval_start_time_epoc); + for (JsonElement element : resultArray) { + JsonArray valueArray = element.getAsJsonArray(); + long epochTime = valueArray.get(0).getAsLong(); + double value = valueArray.get(1).getAsDouble(); + String timestamp = sdf.format(new Date(epochTime * KruizeConstants.TimeConv.NO_OF_MSECS_IN_SEC)); + Date date = sdf.parse(timestamp); + Timestamp eTime = new Timestamp(date.getTime()); + + // Prepare interval results + prepareIntervalResults(containerDataResults, intervalResults, resMap, metricResults, + metricAggregationInfoResults, sTime, eTime, metricEntry, aggregationFunctionsEntry, value, format); + } } + } catch (Exception e) { + throw new RuntimeException(e); } } } @@ -2206,5 +2373,28 @@ private void prepareIntervalResults(Map dataResultsM throw new Exception(AnalyzerErrorConstants.APIErrors.UpdateRecommendationsAPI.METRIC_EXCEPTION + e.getMessage()); } } + + /** + * Filters out maxDateQuery and includes metrics based on the experiment type and kubernetes_object + * @param metricProfile Metric profile to be used + * @param maxDateQuery maxDateQuery metric to be filtered out + * @param experimentType experiment type + */ + public List filterMetricsBasedOnExpTypeAndK8sObject(PerformanceProfile metricProfile, String maxDateQuery, String experimentType) { + String namespace = KruizeConstants.JSONKeys.NAMESPACE; + String container = KruizeConstants.JSONKeys.CONTAINER; + return metricProfile.getSloInfo().getFunctionVariables().stream() + .filter(Metric -> { + String name = Metric.getName(); + String kubernetes_object = Metric.getKubernetesObject(); + + // Include metrics based on experiment_type, kubernetes_object and exclude maxDate metric + return !name.equals(maxDateQuery) && ( + (experimentType.equals(AnalyzerConstants.ExperimentTypes.NAMESPACE_EXPERIMENT) && kubernetes_object.equals(namespace)) || + (experimentType.equals(AnalyzerConstants.ExperimentTypes.CONTAINER_EXPERIMENT) && kubernetes_object.equals(container)) + ); + }) + .toList(); + } } diff --git a/src/main/java/com/autotune/analyzer/recommendations/model/CostBasedRecommendationModel.java b/src/main/java/com/autotune/analyzer/recommendations/model/CostBasedRecommendationModel.java index db8c783ae..891168f4f 100644 --- a/src/main/java/com/autotune/analyzer/recommendations/model/CostBasedRecommendationModel.java +++ b/src/main/java/com/autotune/analyzer/recommendations/model/CostBasedRecommendationModel.java @@ -3,16 +3,21 @@ import com.autotune.analyzer.recommendations.RecommendationConfigItem; import com.autotune.analyzer.recommendations.RecommendationConstants; import com.autotune.analyzer.recommendations.RecommendationNotification; +import com.autotune.analyzer.recommendations.utils.RecommendationUtils; import com.autotune.analyzer.utils.AnalyzerConstants; +import com.autotune.common.data.metrics.AcceleratorMetricResult; import com.autotune.common.data.metrics.MetricAggregationInfoResults; import com.autotune.common.data.metrics.MetricResults; import com.autotune.common.data.result.IntervalResults; +import com.autotune.common.data.system.info.device.accelerator.metadata.AcceleratorMetaDataService; +import com.autotune.common.data.system.info.device.accelerator.metadata.AcceleratorProfile; import com.autotune.common.utils.CommonUtils; import com.autotune.utils.KruizeConstants; import org.json.JSONArray; import org.json.JSONObject; import org.slf4j.Logger; import org.slf4j.LoggerFactory; +import software.amazon.awssdk.services.cloudwatchlogs.endpoints.internal.Value; import java.sql.Timestamp; import java.util.*; @@ -22,6 +27,8 @@ import static com.autotune.analyzer.recommendations.RecommendationConstants.RecommendationEngine.PercentileConstants.COST_CPU_PERCENTILE; import static com.autotune.analyzer.recommendations.RecommendationConstants.RecommendationEngine.PercentileConstants.COST_MEMORY_PERCENTILE; +import static com.autotune.analyzer.recommendations.RecommendationConstants.RecommendationEngine.PercentileConstants.COST_ACCELERATOR_PERCENTILE; + import static com.autotune.analyzer.recommendations.RecommendationConstants.RecommendationValueConstants.*; public class CostBasedRecommendationModel implements RecommendationModel { @@ -505,6 +512,80 @@ public RecommendationConfigItem getMemoryRequestRecommendationForNamespace(Map getAcceleratorRequestRecommendation ( + Map filteredResultsMap, + ArrayList notifications + ) { + List acceleratorCoreMaxValues = new ArrayList<>(); + List acceleratorMemoryMaxValues = new ArrayList<>(); + + boolean isGpuWorkload = false; + String acceleratorModel = null; + + for (Map.Entry entry : filteredResultsMap.entrySet()) { + IntervalResults intervalResults = entry.getValue(); + + // Skip if accelerator map is null + if (null == intervalResults.getAcceleratorMetricResultHashMap()) + continue; + + isGpuWorkload = true; + for (Map.Entry gpuEntry : intervalResults.getAcceleratorMetricResultHashMap().entrySet()) { + AcceleratorMetricResult gpuMetricResult = gpuEntry.getValue(); + + // Set Accelerator name + // TODO: Need to handle separate processing in case of container supporting multiple accelerators + if (null == acceleratorModel + && null != gpuMetricResult.getAcceleratorDeviceData().getModelName() + && !gpuMetricResult.getAcceleratorDeviceData().getModelName().isEmpty() + && RecommendationUtils.checkIfModelIsKruizeSupportedMIG(gpuMetricResult.getAcceleratorDeviceData().getModelName()) + ) { + String obtainedAcceleratorName = RecommendationUtils.getSupportedModelBasedOnModelName(gpuMetricResult.getAcceleratorDeviceData().getModelName()); + if (null != obtainedAcceleratorName) + acceleratorModel = obtainedAcceleratorName; + } + + MetricResults metricResults = gpuMetricResult.getMetricResults(); + + // Skip if metric results is null + if (null == metricResults || null == metricResults.getAggregationInfoResult()) + continue; + + MetricAggregationInfoResults aggregationInfo = metricResults.getAggregationInfoResult(); + + // Skip if max is null or zero or negative + if (null == aggregationInfo.getMax() || aggregationInfo.getMax() <= 0.0) + continue; + + boolean isCoreUsage = gpuEntry.getKey() == AnalyzerConstants.MetricName.gpuCoreUsage; + boolean isMemoryUsage = gpuEntry.getKey() == AnalyzerConstants.MetricName.gpuMemoryUsage; + + // Skip if it's none of the Accelerator metrics + if (!isCoreUsage && !isMemoryUsage) + continue; + + if (isCoreUsage) { + acceleratorCoreMaxValues.add(aggregationInfo.getMax()); + } else { + acceleratorMemoryMaxValues.add(aggregationInfo.getMax()); + } + } + } + + if (!isGpuWorkload) { + return null; + } + + double coreAverage = CommonUtils.percentile(COST_ACCELERATOR_PERCENTILE, acceleratorCoreMaxValues); + double memoryAverage = CommonUtils.percentile(COST_ACCELERATOR_PERCENTILE, acceleratorMemoryMaxValues); + + double coreFraction = coreAverage / 100; + double memoryFraction = memoryAverage / 100; + + return RecommendationUtils.getMapWithOptimalProfile(acceleratorModel, coreFraction, memoryFraction); + } + public static JSONObject calculateNamespaceMemoryUsage(IntervalResults intervalResults) { // create a JSON object which should be returned here having two values, Math.max and Collections.Min JSONObject jsonObject = new JSONObject(); diff --git a/src/main/java/com/autotune/analyzer/recommendations/model/PerformanceBasedRecommendationModel.java b/src/main/java/com/autotune/analyzer/recommendations/model/PerformanceBasedRecommendationModel.java index fcaccd344..0cd9eee41 100644 --- a/src/main/java/com/autotune/analyzer/recommendations/model/PerformanceBasedRecommendationModel.java +++ b/src/main/java/com/autotune/analyzer/recommendations/model/PerformanceBasedRecommendationModel.java @@ -3,8 +3,10 @@ import com.autotune.analyzer.recommendations.RecommendationConfigItem; import com.autotune.analyzer.recommendations.RecommendationConstants; import com.autotune.analyzer.recommendations.RecommendationNotification; +import com.autotune.analyzer.recommendations.utils.RecommendationUtils; import com.autotune.analyzer.services.UpdateRecommendations; import com.autotune.analyzer.utils.AnalyzerConstants; +import com.autotune.common.data.metrics.AcceleratorMetricResult; import com.autotune.common.data.metrics.MetricAggregationInfoResults; import com.autotune.common.data.metrics.MetricResults; import com.autotune.common.data.result.IntervalResults; @@ -19,8 +21,8 @@ import java.util.*; import java.util.stream.Collectors; -import static com.autotune.analyzer.recommendations.RecommendationConstants.RecommendationEngine.PercentileConstants.PERFORMANCE_CPU_PERCENTILE; -import static com.autotune.analyzer.recommendations.RecommendationConstants.RecommendationEngine.PercentileConstants.PERFORMANCE_MEMORY_PERCENTILE; +import static com.autotune.analyzer.recommendations.RecommendationConstants.RecommendationEngine.PercentileConstants.*; +import static com.autotune.analyzer.recommendations.RecommendationConstants.RecommendationEngine.PercentileConstants.PERFORMANCE_ACCELERATOR_PERCENTILE; import static com.autotune.analyzer.recommendations.RecommendationConstants.RecommendationValueConstants.*; public class PerformanceBasedRecommendationModel implements RecommendationModel { @@ -372,6 +374,77 @@ public RecommendationConfigItem getMemoryRequestRecommendationForNamespace(Map getAcceleratorRequestRecommendation(Map filteredResultsMap, ArrayList notifications) { + List acceleratorCoreMaxValues = new ArrayList<>(); + List acceleratorMemoryMaxValues = new ArrayList<>(); + + boolean isGpuWorkload = false; + String acceleratorModel = null; + + for (Map.Entry entry : filteredResultsMap.entrySet()) { + IntervalResults intervalResults = entry.getValue(); + + // Skip if accelerator map is null + if (null == intervalResults.getAcceleratorMetricResultHashMap()) + continue; + + isGpuWorkload = true; + for (Map.Entry gpuEntry : intervalResults.getAcceleratorMetricResultHashMap().entrySet()) { + AcceleratorMetricResult gpuMetricResult = gpuEntry.getValue(); + + // Set Accelerator name + if (null == acceleratorModel + && null != gpuMetricResult.getAcceleratorDeviceData().getModelName() + && !gpuMetricResult.getAcceleratorDeviceData().getModelName().isEmpty() + && RecommendationUtils.checkIfModelIsKruizeSupportedMIG(gpuMetricResult.getAcceleratorDeviceData().getModelName()) + ) { + String obtainedAcceleratorName = RecommendationUtils.getSupportedModelBasedOnModelName(gpuMetricResult.getAcceleratorDeviceData().getModelName()); + + if (null != obtainedAcceleratorName) + acceleratorModel = obtainedAcceleratorName; + } + + MetricResults metricResults = gpuMetricResult.getMetricResults(); + + // Skip if metric results is null + if (null == metricResults || null == metricResults.getAggregationInfoResult()) + continue; + + MetricAggregationInfoResults aggregationInfo = metricResults.getAggregationInfoResult(); + + // Skip if max is null or zero or negative + if (null == aggregationInfo.getMax() || aggregationInfo.getMax() <= 0.0) + continue; + + boolean isCoreUsage = gpuEntry.getKey() == AnalyzerConstants.MetricName.gpuCoreUsage; + boolean isMemoryUsage = gpuEntry.getKey() == AnalyzerConstants.MetricName.gpuMemoryUsage; + + // Skip if it's none of the Accelerator metrics + if (!isCoreUsage && !isMemoryUsage) + continue; + + if (isCoreUsage) { + acceleratorCoreMaxValues.add(aggregationInfo.getMax()); + } else { + acceleratorMemoryMaxValues.add(aggregationInfo.getMax()); + } + } + } + + if (!isGpuWorkload) { + return null; + } + + double coreAverage = CommonUtils.percentile(PERFORMANCE_ACCELERATOR_PERCENTILE, acceleratorCoreMaxValues); + double memoryAverage = CommonUtils.percentile(PERFORMANCE_ACCELERATOR_PERCENTILE, acceleratorMemoryMaxValues); + + double coreFraction = coreAverage / 100; + double memoryFraction = memoryAverage / 100; + + return RecommendationUtils.getMapWithOptimalProfile(acceleratorModel, coreFraction, memoryFraction); + } + @Override public String getModelName() { return this.name; diff --git a/src/main/java/com/autotune/analyzer/recommendations/model/RecommendationModel.java b/src/main/java/com/autotune/analyzer/recommendations/model/RecommendationModel.java index 5a905805b..923ac0d20 100644 --- a/src/main/java/com/autotune/analyzer/recommendations/model/RecommendationModel.java +++ b/src/main/java/com/autotune/analyzer/recommendations/model/RecommendationModel.java @@ -2,6 +2,7 @@ import com.autotune.analyzer.recommendations.RecommendationConfigItem; import com.autotune.analyzer.recommendations.RecommendationNotification; +import com.autotune.analyzer.utils.AnalyzerConstants; import com.autotune.common.data.result.IntervalResults; import java.sql.Timestamp; @@ -17,6 +18,8 @@ public interface RecommendationModel { // get namespace recommendations for Memory Request RecommendationConfigItem getMemoryRequestRecommendationForNamespace(Map filteredResultsMap, ArrayList notifications); + Map getAcceleratorRequestRecommendation(Map filteredResultsMap, ArrayList notifications); + public String getModelName(); void validate(); diff --git a/src/main/java/com/autotune/analyzer/recommendations/utils/RecommendationUtils.java b/src/main/java/com/autotune/analyzer/recommendations/utils/RecommendationUtils.java index 2deac4110..45085f33c 100644 --- a/src/main/java/com/autotune/analyzer/recommendations/utils/RecommendationUtils.java +++ b/src/main/java/com/autotune/analyzer/recommendations/utils/RecommendationUtils.java @@ -1,19 +1,39 @@ package com.autotune.analyzer.recommendations.utils; +import com.autotune.analyzer.exceptions.FetchMetricsError; import com.autotune.analyzer.recommendations.RecommendationConfigItem; import com.autotune.analyzer.recommendations.RecommendationConstants; -import com.autotune.analyzer.recommendations.RecommendationNotification; +import com.autotune.analyzer.recommendations.term.Terms; import com.autotune.analyzer.utils.AnalyzerConstants; import com.autotune.common.data.metrics.MetricResults; import com.autotune.common.data.result.ContainerData; import com.autotune.common.data.result.IntervalResults; +import com.autotune.common.data.system.info.device.ContainerDeviceList; +import com.autotune.common.data.system.info.device.accelerator.AcceleratorDeviceData; +import com.autotune.common.datasource.DataSourceInfo; +import com.autotune.utils.GenericRestApiClient; import com.autotune.utils.KruizeConstants; +import com.google.gson.*; +import org.json.JSONObject; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; +import com.autotune.common.data.system.info.device.accelerator.metadata.AcceleratorMetaDataService; +import com.autotune.common.data.system.info.device.accelerator.metadata.AcceleratorProfile; +import java.io.IOException; +import java.net.URLEncoder; +import java.security.KeyManagementException; +import java.security.KeyStoreException; +import java.security.NoSuchAlgorithmException; import java.sql.Timestamp; -import java.time.LocalDateTime; +import java.text.ParseException; +import java.text.SimpleDateFormat; import java.util.*; +import static com.autotune.analyzer.utils.AnalyzerConstants.ServiceConstants.CHARACTER_ENCODING; + public class RecommendationUtils { + private static final Logger LOGGER = LoggerFactory.getLogger(RecommendationUtils.class); public static RecommendationConfigItem getCurrentValue(Map filteredResultsMap, Timestamp timestampToExtract, AnalyzerConstants.ResourceSetting resourceSetting, @@ -28,15 +48,15 @@ public static RecommendationConfigItem getCurrentValue(Map termsMap, + Double measurementDurationMinutesInDouble, + String gpuDetectionQuery) + throws IOException, NoSuchAlgorithmException, KeyStoreException, + KeyManagementException, ParseException, FetchMetricsError { + + SimpleDateFormat sdf = new SimpleDateFormat(KruizeConstants.DateFormats.STANDARD_JSON_DATE_FORMAT, Locale.ROOT); + String containerName = containerData.getContainer_name(); + String queryToEncode = null; + long interval_end_time_epoc = 0; + long interval_start_time_epoc = 0; + + LOGGER.debug("maxDateQuery: {}", maxDateQuery); + queryToEncode = maxDateQuery + .replace(AnalyzerConstants.NAMESPACE_VARIABLE, namespace) + .replace(AnalyzerConstants.CONTAINER_VARIABLE, containerName) + .replace(AnalyzerConstants.WORKLOAD_VARIABLE, workload) + .replace(AnalyzerConstants.WORKLOAD_TYPE_VARIABLE, workload_type); + + String dateMetricsUrl = String.format(KruizeConstants.DataSourceConstants.DATE_ENDPOINT_WITH_QUERY, + dataSourceInfo.getUrl(), + URLEncoder.encode(queryToEncode, CHARACTER_ENCODING) + ); + + LOGGER.debug(dateMetricsUrl); + GenericRestApiClient client = new GenericRestApiClient(dataSourceInfo); + client.setBaseURL(dateMetricsUrl); + JSONObject genericJsonObject = client.fetchMetricsJson(KruizeConstants.APIMessages.GET, ""); + JsonObject jsonObject = new Gson().fromJson(genericJsonObject.toString(), JsonObject.class); + JsonArray resultArray = jsonObject.getAsJsonObject(KruizeConstants.JSONKeys.DATA).getAsJsonArray(KruizeConstants.DataSourceConstants.DataSourceQueryJSONKeys.RESULT); + + if (null == resultArray || resultArray.isEmpty()) { + // Need to alert that container max duration is not detected + // Ignoring it here, as we take care of it at generate recommendations + return; + } + + resultArray = resultArray.get(0) + .getAsJsonObject().getAsJsonArray(KruizeConstants.DataSourceConstants.DataSourceQueryJSONKeys.VALUE); + long epochTime = resultArray.get(0).getAsLong(); + String timestamp = sdf.format(new Date(epochTime * KruizeConstants.TimeConv.NO_OF_MSECS_IN_SEC)); + Date date = sdf.parse(timestamp); + Timestamp dateTS = new Timestamp(date.getTime()); + interval_end_time_epoc = dateTS.getTime() / KruizeConstants.TimeConv.NO_OF_MSECS_IN_SEC + - ((long) dateTS.getTimezoneOffset() * KruizeConstants.TimeConv.NO_OF_SECONDS_PER_MINUTE); + int maxDay = Terms.getMaxDays(termsMap); + LOGGER.debug(KruizeConstants.APIMessages.MAX_DAY, maxDay); + Timestamp startDateTS = Timestamp.valueOf(Objects.requireNonNull(dateTS).toLocalDateTime().minusDays(maxDay)); + interval_start_time_epoc = startDateTS.getTime() / KruizeConstants.TimeConv.NO_OF_MSECS_IN_SEC + - ((long) startDateTS.getTimezoneOffset() * KruizeConstants.TimeConv.NO_OF_MSECS_IN_SEC); + + gpuDetectionQuery = gpuDetectionQuery.replace(AnalyzerConstants.NAMESPACE_VARIABLE, namespace) + .replace(AnalyzerConstants.CONTAINER_VARIABLE, containerName) + .replace(AnalyzerConstants.MEASUREMENT_DURATION_IN_MIN_VARAIBLE, Integer.toString(measurementDurationMinutesInDouble.intValue())) + .replace(AnalyzerConstants.WORKLOAD_VARIABLE, workload) + .replace(AnalyzerConstants.WORKLOAD_TYPE_VARIABLE, workload_type); + + String podMetricsUrl; + try { + podMetricsUrl = String.format(KruizeConstants.DataSourceConstants.DATASOURCE_ENDPOINT_WITH_QUERY, + dataSourceInfo.getUrl(), + URLEncoder.encode(gpuDetectionQuery, CHARACTER_ENCODING), + interval_start_time_epoc, + interval_end_time_epoc, + measurementDurationMinutesInDouble.intValue() * KruizeConstants.TimeConv.NO_OF_SECONDS_PER_MINUTE); + LOGGER.debug(podMetricsUrl); + client.setBaseURL(podMetricsUrl); + genericJsonObject = client.fetchMetricsJson(KruizeConstants.APIMessages.GET, ""); + + jsonObject = new Gson().fromJson(genericJsonObject.toString(), JsonObject.class); + resultArray = jsonObject.getAsJsonObject(KruizeConstants.JSONKeys.DATA).getAsJsonArray(KruizeConstants.DataSourceConstants.DataSourceQueryJSONKeys.RESULT); + + if (null != resultArray && !resultArray.isEmpty()) { + for (JsonElement result : resultArray) { + JsonObject resultObject = result.getAsJsonObject(); + JsonArray valuesArray = resultObject.getAsJsonArray(KruizeConstants.DataSourceConstants + .DataSourceQueryJSONKeys.VALUES); + + for (JsonElement element : valuesArray) { + JsonArray valueArray = element.getAsJsonArray(); + double value = valueArray.get(1).getAsDouble(); + // TODO: Check for non-zero values to mark as GPU workload + break; + } + + JsonObject metricObject = resultObject.getAsJsonObject(KruizeConstants.JSONKeys.METRIC); + String modelName = metricObject.get(KruizeConstants.JSONKeys.MODEL_NAME).getAsString(); + if (null == modelName) + continue; + + boolean isSupportedMig = checkIfModelIsKruizeSupportedMIG(modelName); + if (isSupportedMig) { + AcceleratorDeviceData acceleratorDeviceData = new AcceleratorDeviceData(metricObject.get(KruizeConstants.JSONKeys.MODEL_NAME).getAsString(), + metricObject.get(KruizeConstants.JSONKeys.HOSTNAME).getAsString(), + metricObject.get(KruizeConstants.JSONKeys.UUID).getAsString(), + metricObject.get(KruizeConstants.JSONKeys.DEVICE).getAsString(), + isSupportedMig); + + + if (null == containerData.getContainerDeviceList()) { + ContainerDeviceList containerDeviceList = new ContainerDeviceList(); + containerData.setContainerDeviceList(containerDeviceList); + } + containerData.getContainerDeviceList().addDevice(AnalyzerConstants.DeviceType.ACCELERATOR, acceleratorDeviceData); + // TODO: Currently we consider only the first mig supported GPU + return; + } + } + } + } catch (IOException | NoSuchAlgorithmException | KeyStoreException | KeyManagementException | + JsonSyntaxException e) { + throw new RuntimeException(e); + } + } + + public static boolean checkIfModelIsKruizeSupportedMIG(String modelName) { + if (null == modelName || modelName.isEmpty()) + return false; + + modelName = modelName.toUpperCase(); + + boolean A100_CHECK = (modelName.contains("A100") && + (modelName.contains("40GB") || modelName.contains("80GB"))); + + boolean H100_CHECK = false; + if (!A100_CHECK) { + H100_CHECK = (modelName.contains("H100") && modelName.contains("80GB")); + } + + return A100_CHECK || H100_CHECK; + } + + public static Timestamp getNearestTimestamp(HashMap containerDataResults, Timestamp targetTime, int minutesRange) { + long rangeInMillis = (long) minutesRange * 60 * 1000; + long targetTimeMillis = targetTime.getTime(); + + Timestamp nearestTimestamp = null; + long nearestDistance = Long.MAX_VALUE; + + for (Map.Entry entry : containerDataResults.entrySet()) { + Timestamp currentTimestamp = entry.getKey(); + long currentTimeMillis = currentTimestamp.getTime(); + long distance = Math.abs(targetTimeMillis - currentTimeMillis); + + if (distance <= rangeInMillis && distance < nearestDistance) { + nearestDistance = distance; + nearestTimestamp = currentTimestamp; + } + } + + return nearestTimestamp; + } + + public static HashMap getMapWithOptimalProfile( + String acceleratorModel, + Double coreFraction, + Double memoryFraction + ) { + if (null == acceleratorModel || null == coreFraction || null == memoryFraction) + return null; + + HashMap returnMap = new HashMap<>(); + + AcceleratorMetaDataService gpuMetaDataService = AcceleratorMetaDataService.getInstance(); + AcceleratorProfile acceleratorProfile = gpuMetaDataService.getAcceleratorProfile(acceleratorModel, coreFraction, memoryFraction); + RecommendationConfigItem recommendationConfigItem = new RecommendationConfigItem(1.0, "cores"); + + if (acceleratorProfile.getProfileName().equalsIgnoreCase(AnalyzerConstants.AcceleratorConstants.AcceleratorProfiles.PROFILE_1G_5GB)) { + returnMap.put(AnalyzerConstants.RecommendationItem.NVIDIA_GPU_PARTITION_1_CORE_5GB, recommendationConfigItem); + } else if (acceleratorProfile.getProfileName().equalsIgnoreCase(AnalyzerConstants.AcceleratorConstants.AcceleratorProfiles.PROFILE_1G_10GB)) { + returnMap.put(AnalyzerConstants.RecommendationItem.NVIDIA_GPU_PARTITION_1_CORE_10GB, recommendationConfigItem); + } else if (acceleratorProfile.getProfileName().equalsIgnoreCase(AnalyzerConstants.AcceleratorConstants.AcceleratorProfiles.PROFILE_1G_20GB)) { + returnMap.put(AnalyzerConstants.RecommendationItem.NVIDIA_GPU_PARTITION_1_CORE_20GB, recommendationConfigItem); + } else if (acceleratorProfile.getProfileName().equalsIgnoreCase(AnalyzerConstants.AcceleratorConstants.AcceleratorProfiles.PROFILE_2G_10GB)) { + returnMap.put(AnalyzerConstants.RecommendationItem.NVIDIA_GPU_PARTITION_2_CORES_10GB, recommendationConfigItem); + } else if (acceleratorProfile.getProfileName().equalsIgnoreCase(AnalyzerConstants.AcceleratorConstants.AcceleratorProfiles.PROFILE_2G_20GB)) { + returnMap.put(AnalyzerConstants.RecommendationItem.NVIDIA_GPU_PARTITION_2_CORES_20GB, recommendationConfigItem); + } else if (acceleratorProfile.getProfileName().equalsIgnoreCase(AnalyzerConstants.AcceleratorConstants.AcceleratorProfiles.PROFILE_3G_20GB)) { + returnMap.put(AnalyzerConstants.RecommendationItem.NVIDIA_GPU_PARTITION_3_CORES_20GB, recommendationConfigItem); + } else if (acceleratorProfile.getProfileName().equalsIgnoreCase(AnalyzerConstants.AcceleratorConstants.AcceleratorProfiles.PROFILE_3G_40GB)) { + returnMap.put(AnalyzerConstants.RecommendationItem.NVIDIA_GPU_PARTITION_3_CORES_40GB, recommendationConfigItem); + } else if (acceleratorProfile.getProfileName().equalsIgnoreCase(AnalyzerConstants.AcceleratorConstants.AcceleratorProfiles.PROFILE_4G_20GB)) { + returnMap.put(AnalyzerConstants.RecommendationItem.NVIDIA_GPU_PARTITION_4_CORES_20GB, recommendationConfigItem); + } else if (acceleratorProfile.getProfileName().equalsIgnoreCase(AnalyzerConstants.AcceleratorConstants.AcceleratorProfiles.PROFILE_4G_40GB)) { + returnMap.put(AnalyzerConstants.RecommendationItem.NVIDIA_GPU_PARTITION_4_CORES_40GB, recommendationConfigItem); + } else if (acceleratorProfile.getProfileName().equalsIgnoreCase(AnalyzerConstants.AcceleratorConstants.AcceleratorProfiles.PROFILE_7G_40GB)) { + returnMap.put(AnalyzerConstants.RecommendationItem.NVIDIA_GPU_PARTITION_7_CORES_40GB, recommendationConfigItem); + } else if (acceleratorProfile.getProfileName().equalsIgnoreCase(AnalyzerConstants.AcceleratorConstants.AcceleratorProfiles.PROFILE_7G_80GB)) { + returnMap.put(AnalyzerConstants.RecommendationItem.NVIDIA_GPU_PARTITION_7_CORES_80GB, recommendationConfigItem); + } + return returnMap; + } + + public static String getSupportedModelBasedOnModelName(String modelName) { + if (null == modelName || modelName.isEmpty()) + return null; + + modelName = modelName.toUpperCase(); + + if (modelName.contains("A100") && modelName.contains("40GB")) + return AnalyzerConstants.AcceleratorConstants.SupportedAccelerators.A100_40_GB; + + if (modelName.contains("A100") && modelName.contains("80GB")) + return AnalyzerConstants.AcceleratorConstants.SupportedAccelerators.A100_80_GB; + + if (modelName.contains("H100") && modelName.contains("80GB")) + return AnalyzerConstants.AcceleratorConstants.SupportedAccelerators.H100_80_GB; + + return null; + } } diff --git a/src/main/java/com/autotune/analyzer/serviceObjects/BulkInput.java b/src/main/java/com/autotune/analyzer/serviceObjects/BulkInput.java new file mode 100644 index 000000000..e5e31d40d --- /dev/null +++ b/src/main/java/com/autotune/analyzer/serviceObjects/BulkInput.java @@ -0,0 +1,139 @@ +/******************************************************************************* + * Copyright (c) 2022 Red Hat, IBM Corporation and others. + * + * Licensed under the Apache License, Version 2.0 (the "License"); + * you may not use this file except in compliance with the License. + * You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + *******************************************************************************/ +package com.autotune.analyzer.serviceObjects; + +import java.util.List; +import java.util.Map; + +/** + * Request payload object for Bulk Api service + */ +public class BulkInput { + private FilterWrapper filter; + private TimeRange time_range; + private String datasource; + + // Getters and Setters + + public TimeRange getTime_range() { + return time_range; + } + + public void setTime_range(TimeRange time_range) { + this.time_range = time_range; + } + + public String getDatasource() { + return datasource; + } + + public void setDatasource(String datasource) { + this.datasource = datasource; + } + + public FilterWrapper getFilter() { + return filter; + } + + public void setFilter(FilterWrapper filter) { + this.filter = filter; + } + + // Nested class for FilterWrapper that contains 'exclude' and 'include' + public static class FilterWrapper { + private Filter exclude; + private Filter include; + + // Getters and Setters + public Filter getExclude() { + return exclude; + } + + public void setExclude(Filter exclude) { + this.exclude = exclude; + } + + public Filter getInclude() { + return include; + } + + public void setInclude(Filter include) { + this.include = include; + } + } + + public static class Filter { + private List namespace; + private List workload; + private List containers; + private Map labels; + + // Getters and Setters + public List getNamespace() { + return namespace; + } + + public void setNamespace(List namespace) { + this.namespace = namespace; + } + + public List getWorkload() { + return workload; + } + + public void setWorkload(List workload) { + this.workload = workload; + } + + public List getContainers() { + return containers; + } + + public void setContainers(List containers) { + this.containers = containers; + } + + public Map getLabels() { + return labels; + } + + public void setLabels(Map labels) { + this.labels = labels; + } + } + + public static class TimeRange { + private String start; + private String end; + + // Getters and Setters + public String getStart() { + return start; + } + + public void setStart(String start) { + this.start = start; + } + + public String getEnd() { + return end; + } + + public void setEnd(String end) { + this.end = end; + } + } +} diff --git a/src/main/java/com/autotune/analyzer/serviceObjects/BulkJobStatus.java b/src/main/java/com/autotune/analyzer/serviceObjects/BulkJobStatus.java new file mode 100644 index 000000000..d45f37774 --- /dev/null +++ b/src/main/java/com/autotune/analyzer/serviceObjects/BulkJobStatus.java @@ -0,0 +1,293 @@ +/******************************************************************************* + * Copyright (c) 2022 Red Hat, IBM Corporation and others. + * + * Licensed under the Apache License, Version 2.0 (the "License"); + * you may not use this file except in compliance with the License. + * You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + *******************************************************************************/ +package com.autotune.analyzer.serviceObjects; + +import com.fasterxml.jackson.annotation.JsonFilter; +import com.fasterxml.jackson.annotation.JsonProperty; + +import java.time.Instant; +import java.time.ZoneOffset; +import java.time.format.DateTimeFormatter; +import java.util.ArrayList; +import java.util.Collections; +import java.util.List; + +import static com.autotune.utils.KruizeConstants.KRUIZE_BULK_API.JOB_ID; + +/** + * Bulk API Response payload Object. + */ +@JsonFilter("jobFilter") +public class BulkJobStatus { + @JsonProperty(JOB_ID) + private String jobID; + private String status; + private int total_experiments; + private int processed_experiments; + private Data data; + @JsonProperty("job_start_time") + private String startTime; // Change to String to store formatted time + @JsonProperty("job_end_time") + private String endTime; // Change to String to store formatted time + private String message; + + public BulkJobStatus(String jobID, String status, Data data, Instant startTime) { + this.jobID = jobID; + this.status = status; + this.data = data; + setStartTime(startTime); + } + + public String getJobID() { + return jobID; + } + + public String getStatus() { + return status; + } + + public void setStatus(String status) { + this.status = status; + } + + public String getStartTime() { + return startTime; + } + + public void setStartTime(Instant startTime) { + this.startTime = formatInstantAsUTCString(startTime); + } + + public void setStartTime(String startTime) { + this.startTime = startTime; + } + + public String getEndTime() { + return endTime; + } + + public void setEndTime(Instant endTime) { + this.endTime = formatInstantAsUTCString(endTime); + } + + public void setEndTime(String endTime) { + this.endTime = endTime; + } + + public int getTotal_experiments() { + return total_experiments; + } + + public void setTotal_experiments(int total_experiments) { + this.total_experiments = total_experiments; + } + + public int getProcessed_experiments() { + return processed_experiments; + } + + public void setProcessed_experiments(int processed_experiments) { + this.processed_experiments = processed_experiments; + } + + public Data getData() { + return data; + } + + public void setData(Data data) { + this.data = data; + } + + // Utility function to format Instant into the required UTC format + private String formatInstantAsUTCString(Instant instant) { + DateTimeFormatter formatter = DateTimeFormatter + .ofPattern("yyyy-MM-dd'T'HH:mm:ss.SSS'Z'") + .withZone(ZoneOffset.UTC); // Ensure it's in UTC + + return formatter.format(instant); + } + + public String getMessage() { + return message; + } + + public void setMessage(String message) { + this.message = message; + } + + // Inner class for the data field + public static class Data { + private Experiments experiments; + private Recommendations recommendations; + + public Data(Experiments experiments, Recommendations recommendations) { + this.experiments = experiments; + this.recommendations = recommendations; + } + + public Experiments getExperiments() { + return experiments; + } + + public void setExperiments(Experiments experiments) { + this.experiments = experiments; + } + + public Recommendations getRecommendations() { + return recommendations; + } + + public void setRecommendations(Recommendations recommendations) { + this.recommendations = recommendations; + } + } + + // Inner class for experiments + public static class Experiments { + @JsonProperty("new") + private List newExperiments; + @JsonProperty("updated") + private List updatedExperiments; + @JsonProperty("failed") + private List failedExperiments; + + public Experiments(List newExperiments, List updatedExperiments) { + this.newExperiments = newExperiments; + this.updatedExperiments = updatedExperiments; + } + + public List getNewExperiments() { + return newExperiments; + } + + public void setNewExperiments(List newExperiments) { + this.newExperiments = newExperiments; + } + + public List getUpdatedExperiments() { + return updatedExperiments; + } + + public void setUpdatedExperiments(List updatedExperiments) { + this.updatedExperiments = updatedExperiments; + } + } + + // Inner class for recommendations + public static class Recommendations { + private RecommendationData data; + + public Recommendations(RecommendationData data) { + this.data = data; + } + + public RecommendationData getData() { + return data; + } + + public void setData(RecommendationData data) { + this.data = data; + } + } + + // Inner class for recommendation data + public static class RecommendationData { + private List processed = Collections.synchronizedList(new ArrayList<>()); + private List processing = Collections.synchronizedList(new ArrayList<>()); + private List unprocessed = Collections.synchronizedList(new ArrayList<>()); + private List failed = Collections.synchronizedList(new ArrayList<>()); + + public RecommendationData(List processed, List processing, List unprocessed, List failed) { + this.processed = processed; + this.processing = processing; + this.unprocessed = unprocessed; + this.failed = failed; + } + + public List getProcessed() { + return processed; + } + + public synchronized void setProcessed(List processed) { + this.processed = processed; + } + + public List getProcessing() { + return processing; + } + + public synchronized void setProcessing(List processing) { + this.processing = processing; + } + + public List getUnprocessed() { + return unprocessed; + } + + public synchronized void setUnprocessed(List unprocessed) { + this.unprocessed = unprocessed; + } + + public List getFailed() { + return failed; + } + + public synchronized void setFailed(List failed) { + this.failed = failed; + } + + // Move elements from inqueue to progress + public synchronized void moveToProgress(String element) { + if (unprocessed.contains(element)) { + unprocessed.remove(element); + if (!processing.contains(element)) { + processing.add(element); + } + } + } + + // Move elements from progress to completed + public synchronized void moveToCompleted(String element) { + if (processing.contains(element)) { + processing.remove(element); + if (!processed.contains(element)) { + processed.add(element); + } + } + } + + // Move elements from progress to failed + public synchronized void moveToFailed(String element) { + if (processing.contains(element)) { + processing.remove(element); + if (!failed.contains(element)) { + failed.add(element); + } + } + } + + // Calculate the percentage of completion + public int completionPercentage() { + int totalTasks = processed.size() + processing.size() + unprocessed.size() + failed.size(); + if (totalTasks == 0) { + return (int) 0.0; + } + return (int) ((processed.size() * 100.0) / totalTasks); + } + + + } +} diff --git a/src/main/java/com/autotune/analyzer/serviceObjects/KubernetesAPIObject.java b/src/main/java/com/autotune/analyzer/serviceObjects/KubernetesAPIObject.java index 0a6d52ecf..d24cc3638 100644 --- a/src/main/java/com/autotune/analyzer/serviceObjects/KubernetesAPIObject.java +++ b/src/main/java/com/autotune/analyzer/serviceObjects/KubernetesAPIObject.java @@ -49,14 +49,26 @@ public String getType() { return type; } + public void setType(String type) { + this.type = type; + } + public String getName() { return name; } + public void setName(String name) { + this.name = name; + } + public String getNamespace() { return namespace; } + public void setNamespace(String namespace) { + this.namespace = namespace; + } + @JsonProperty(KruizeConstants.JSONKeys.CONTAINERS) public List getContainerAPIObjects() { return containerAPIObjects; diff --git a/src/main/java/com/autotune/analyzer/services/BulkService.java b/src/main/java/com/autotune/analyzer/services/BulkService.java new file mode 100644 index 000000000..1f7e3debf --- /dev/null +++ b/src/main/java/com/autotune/analyzer/services/BulkService.java @@ -0,0 +1,159 @@ +/******************************************************************************* + * Copyright (c) 2022 Red Hat, IBM Corporation and others. + * + * Licensed under the Apache License, Version 2.0 (the "License"); + * you may not use this file except in compliance with the License. + * You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + *******************************************************************************/ +package com.autotune.analyzer.services; + +import com.autotune.analyzer.serviceObjects.BulkInput; +import com.autotune.analyzer.serviceObjects.BulkJobStatus; +import com.autotune.analyzer.workerimpl.BulkJobManager; +import com.fasterxml.jackson.databind.ObjectMapper; +import com.fasterxml.jackson.databind.ser.impl.SimpleBeanPropertyFilter; +import com.fasterxml.jackson.databind.ser.impl.SimpleFilterProvider; +import org.json.JSONObject; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +import javax.servlet.ServletConfig; +import javax.servlet.ServletException; +import javax.servlet.annotation.WebServlet; +import javax.servlet.http.HttpServlet; +import javax.servlet.http.HttpServletRequest; +import javax.servlet.http.HttpServletResponse; +import java.io.IOException; +import java.time.Instant; +import java.util.ArrayList; +import java.util.Map; +import java.util.UUID; +import java.util.concurrent.ConcurrentHashMap; +import java.util.concurrent.ExecutorService; +import java.util.concurrent.Executors; + +import static com.autotune.analyzer.utils.AnalyzerConstants.ServiceConstants.*; +import static com.autotune.utils.KruizeConstants.KRUIZE_BULK_API.*; + +/** + * + */ +@WebServlet(asyncSupported = true) +public class BulkService extends HttpServlet { + private static final long serialVersionUID = 1L; + private static final Logger LOGGER = LoggerFactory.getLogger(BulkService.class); + private ExecutorService executorService = Executors.newFixedThreadPool(10); + private Map jobStatusMap = new ConcurrentHashMap<>(); + + @Override + public void init(ServletConfig config) throws ServletException { + super.init(config); + } + + /** + * @param req + * @param resp + * @throws ServletException + * @throws IOException + */ + @Override + protected void doGet(HttpServletRequest req, HttpServletResponse resp) throws ServletException, IOException { + String jobID = req.getParameter(JOB_ID); + String verboseParam = req.getParameter(VERBOSE); + // If the parameter is not provided (null), default it to false + boolean verbose = verboseParam != null && Boolean.parseBoolean(verboseParam); + BulkJobStatus jobDetails = jobStatusMap.get(jobID); + resp.setContentType(JSON_CONTENT_TYPE); + resp.setCharacterEncoding(CHARACTER_ENCODING); + SimpleFilterProvider filters = new SimpleFilterProvider(); + + if (jobDetails == null) { + sendErrorResponse( + resp, + null, + HttpServletResponse.SC_NOT_FOUND, + JOB_NOT_FOUND_MSG + ); + } else { + try { + resp.setStatus(HttpServletResponse.SC_OK); + // Return the JSON representation of the JobStatus object + ObjectMapper objectMapper = new ObjectMapper(); + if (!verbose) { + filters.addFilter("jobFilter", SimpleBeanPropertyFilter.serializeAllExcept("data")); + } else { + filters.addFilter("jobFilter", SimpleBeanPropertyFilter.serializeAll()); + } + objectMapper.setFilterProvider(filters); + String jsonResponse = objectMapper.writeValueAsString(jobDetails); + resp.getWriter().write(jsonResponse); + } catch (Exception e) { + e.printStackTrace(); + } + } + } + + /** + * @param request + * @param response + * @throws ServletException + * @throws IOException + */ + @Override + protected void doPost(HttpServletRequest request, HttpServletResponse response) throws ServletException, IOException { + // Set response type + response.setContentType(JSON_CONTENT_TYPE); + response.setCharacterEncoding(CHARACTER_ENCODING); + + // Create ObjectMapper instance + ObjectMapper objectMapper = new ObjectMapper(); + + // Read the request payload and map to RequestPayload class + BulkInput payload = objectMapper.readValue(request.getInputStream(), BulkInput.class); + + // Generate a unique jobID + String jobID = UUID.randomUUID().toString(); + BulkJobStatus.Data data = new BulkJobStatus.Data( + new BulkJobStatus.Experiments(new ArrayList<>(), new ArrayList<>()), + new BulkJobStatus.Recommendations(new BulkJobStatus.RecommendationData( + new ArrayList<>(), + new ArrayList<>(), + new ArrayList<>(), + new ArrayList<>() + )) + ); + jobStatusMap.put(jobID, new BulkJobStatus(jobID, IN_PROGRESS, data, Instant.now())); + // Submit the job to be processed asynchronously + executorService.submit(new BulkJobManager(jobID, jobStatusMap, payload)); + + // Just sending a simple success response back + // Return the jobID to the user + JSONObject jsonObject = new JSONObject(); + jsonObject.put(JOB_ID, jobID); + response.getWriter().write(jsonObject.toString()); + } + + + @Override + public void destroy() { + executorService.shutdown(); + } + + public void sendErrorResponse(HttpServletResponse response, Exception e, int httpStatusCode, String errorMsg) throws + IOException { + if (null != e) { + LOGGER.error(e.toString()); + e.printStackTrace(); + if (null == errorMsg) errorMsg = e.getMessage(); + } + response.sendError(httpStatusCode, errorMsg); + } +} diff --git a/src/main/java/com/autotune/analyzer/services/DSMetadataService.java b/src/main/java/com/autotune/analyzer/services/DSMetadataService.java index 904ada6ad..4f786b419 100644 --- a/src/main/java/com/autotune/analyzer/services/DSMetadataService.java +++ b/src/main/java/com/autotune/analyzer/services/DSMetadataService.java @@ -16,6 +16,8 @@ package com.autotune.analyzer.services; +import com.autotune.analyzer.adapters.DeviceDetailsAdapter; +import com.autotune.analyzer.adapters.RecommendationItemAdapter; import com.autotune.analyzer.exceptions.KruizeResponse; import com.autotune.analyzer.serviceObjects.DSMetadataAPIObject; import com.autotune.analyzer.utils.AnalyzerConstants; @@ -23,6 +25,7 @@ import com.autotune.analyzer.utils.GsonUTCDateAdapter; import com.autotune.common.data.ValidationOutputData; import com.autotune.common.data.dataSourceMetadata.DataSourceMetadataInfo; +import com.autotune.common.data.system.info.device.DeviceDetails; import com.autotune.common.datasource.DataSourceInfo; import com.autotune.common.datasource.DataSourceManager; import com.autotune.common.datasource.DataSourceMetadataValidation; @@ -130,7 +133,7 @@ protected void doPost(HttpServletRequest request, HttpServletResponse response) return; } - DataSourceMetadataInfo metadataInfo = dataSourceManager.importMetadataFromDataSource(datasource); + DataSourceMetadataInfo metadataInfo = dataSourceManager.importMetadataFromDataSource(datasource,"",0,0,0); // Validate imported metadataInfo object DataSourceMetadataValidation validationObject = new DataSourceMetadataValidation(); @@ -240,6 +243,7 @@ private void sendSuccessResponse(HttpServletResponse response, DataSourceMetadat .setPrettyPrinting() .enableComplexMapKeySerialization() .registerTypeAdapter(Date.class, new GsonUTCDateAdapter()) + .registerTypeAdapter(AnalyzerConstants.RecommendationItem.class, new RecommendationItemAdapter()) .create(); gsonStr = gsonObj.toJson(dataSourceMetadata); } @@ -416,6 +420,8 @@ private Gson createGsonObject() { .setPrettyPrinting() .enableComplexMapKeySerialization() .registerTypeAdapter(Date.class, new GsonUTCDateAdapter()) + .registerTypeAdapter(AnalyzerConstants.RecommendationItem.class, new RecommendationItemAdapter()) + .registerTypeAdapter(DeviceDetails.class, new DeviceDetailsAdapter()) .create(); } private boolean isValidBooleanValue(String value) { diff --git a/src/main/java/com/autotune/analyzer/services/GenerateRecommendations.java b/src/main/java/com/autotune/analyzer/services/GenerateRecommendations.java index 64d05fe9c..8a2d5f22c 100644 --- a/src/main/java/com/autotune/analyzer/services/GenerateRecommendations.java +++ b/src/main/java/com/autotune/analyzer/services/GenerateRecommendations.java @@ -15,6 +15,8 @@ *******************************************************************************/ package com.autotune.analyzer.services; +import com.autotune.analyzer.adapters.DeviceDetailsAdapter; +import com.autotune.analyzer.adapters.RecommendationItemAdapter; import com.autotune.analyzer.exceptions.FetchMetricsError; import com.autotune.analyzer.kruizeObject.KruizeObject; import com.autotune.analyzer.recommendations.engine.RecommendationEngine; @@ -29,6 +31,7 @@ import com.autotune.common.data.metrics.MetricResults; import com.autotune.common.data.result.ContainerData; import com.autotune.common.data.result.IntervalResults; +import com.autotune.common.data.system.info.device.DeviceDetails; import com.autotune.common.datasource.DataSourceInfo; import com.autotune.common.k8sObjects.K8sObject; import com.autotune.utils.GenericRestApiClient; @@ -171,6 +174,8 @@ public boolean shouldSkipClass(Class clazz) { .setPrettyPrinting() .enableComplexMapKeySerialization() .registerTypeAdapter(Date.class, new GsonUTCDateAdapter()) + .registerTypeAdapter(AnalyzerConstants.RecommendationItem.class, new RecommendationItemAdapter()) + .registerTypeAdapter(DeviceDetails.class, new DeviceDetailsAdapter()) .setExclusionStrategies(strategy) .create(); gsonStr = gsonObj.toJson(recommendationList); diff --git a/src/main/java/com/autotune/analyzer/services/ListDatasources.java b/src/main/java/com/autotune/analyzer/services/ListDatasources.java index 1af77454d..9493f3ad5 100644 --- a/src/main/java/com/autotune/analyzer/services/ListDatasources.java +++ b/src/main/java/com/autotune/analyzer/services/ListDatasources.java @@ -16,10 +16,13 @@ package com.autotune.analyzer.services; +import com.autotune.analyzer.adapters.DeviceDetailsAdapter; +import com.autotune.analyzer.adapters.RecommendationItemAdapter; import com.autotune.analyzer.serviceObjects.ListDatasourcesAPIObject; import com.autotune.analyzer.utils.AnalyzerConstants; import com.autotune.analyzer.utils.AnalyzerErrorConstants; import com.autotune.analyzer.utils.GsonUTCDateAdapter; +import com.autotune.common.data.system.info.device.DeviceDetails; import com.autotune.common.datasource.DataSourceInfo; import com.autotune.database.service.ExperimentDBService; import com.autotune.utils.MetricsConfig; @@ -148,6 +151,8 @@ private Gson createGsonObject() { .setPrettyPrinting() .enableComplexMapKeySerialization() .registerTypeAdapter(Date.class, new GsonUTCDateAdapter()) + .registerTypeAdapter(AnalyzerConstants.RecommendationItem.class, new RecommendationItemAdapter()) + .registerTypeAdapter(DeviceDetails.class, new DeviceDetailsAdapter()) .create(); } diff --git a/src/main/java/com/autotune/analyzer/services/ListExperiments.java b/src/main/java/com/autotune/analyzer/services/ListExperiments.java index e71d5e96f..b8ca71447 100644 --- a/src/main/java/com/autotune/analyzer/services/ListExperiments.java +++ b/src/main/java/com/autotune/analyzer/services/ListExperiments.java @@ -16,6 +16,8 @@ package com.autotune.analyzer.services; +import com.autotune.analyzer.adapters.DeviceDetailsAdapter; +import com.autotune.analyzer.adapters.RecommendationItemAdapter; import com.autotune.analyzer.experiment.KruizeExperiment; import com.autotune.analyzer.kruizeObject.KruizeObject; import com.autotune.analyzer.serviceObjects.ContainerAPIObject; @@ -29,6 +31,7 @@ import com.autotune.common.data.metrics.MetricResults; import com.autotune.common.data.result.ContainerData; import com.autotune.common.data.result.IntervalResults; +import com.autotune.common.data.system.info.device.DeviceDetails; import com.autotune.common.k8sObjects.K8sObject; import com.autotune.common.target.kubernetes.service.KubernetesServices; import com.autotune.common.trials.ExperimentTrial; @@ -281,6 +284,8 @@ private Gson createGsonObject() { .setPrettyPrinting() .enableComplexMapKeySerialization() .registerTypeAdapter(Date.class, new GsonUTCDateAdapter()) + .registerTypeAdapter(AnalyzerConstants.RecommendationItem.class, new RecommendationItemAdapter()) + .registerTypeAdapter(DeviceDetails.class, new DeviceDetailsAdapter()) .setExclusionStrategies(new ExclusionStrategy() { @Override public boolean shouldSkipField(FieldAttributes f) { diff --git a/src/main/java/com/autotune/analyzer/services/ListRecommendations.java b/src/main/java/com/autotune/analyzer/services/ListRecommendations.java index 69bcca37c..ee533905f 100644 --- a/src/main/java/com/autotune/analyzer/services/ListRecommendations.java +++ b/src/main/java/com/autotune/analyzer/services/ListRecommendations.java @@ -16,6 +16,8 @@ package com.autotune.analyzer.services; +import com.autotune.analyzer.adapters.DeviceDetailsAdapter; +import com.autotune.analyzer.adapters.RecommendationItemAdapter; import com.autotune.analyzer.exceptions.KruizeResponse; import com.autotune.analyzer.kruizeObject.KruizeObject; import com.autotune.analyzer.serviceObjects.ContainerAPIObject; @@ -26,6 +28,7 @@ import com.autotune.analyzer.utils.GsonUTCDateAdapter; import com.autotune.analyzer.utils.ServiceHelpers; import com.autotune.common.data.result.ContainerData; +import com.autotune.common.data.system.info.device.DeviceDetails; import com.autotune.database.service.ExperimentDBService; import com.autotune.utils.KruizeConstants; import com.autotune.utils.MetricsConfig; @@ -224,6 +227,8 @@ public boolean shouldSkipClass(Class clazz) { .setPrettyPrinting() .enableComplexMapKeySerialization() .registerTypeAdapter(Date.class, new GsonUTCDateAdapter()) + .registerTypeAdapter(AnalyzerConstants.RecommendationItem.class, new RecommendationItemAdapter()) + .registerTypeAdapter(DeviceDetails.class, new DeviceDetailsAdapter()) .setExclusionStrategies(strategy) .create(); gsonStr = gsonObj.toJson(recommendationList); diff --git a/src/main/java/com/autotune/analyzer/services/ListSupportedK8sObjects.java b/src/main/java/com/autotune/analyzer/services/ListSupportedK8sObjects.java index 1ac7dc39d..f0b2db569 100644 --- a/src/main/java/com/autotune/analyzer/services/ListSupportedK8sObjects.java +++ b/src/main/java/com/autotune/analyzer/services/ListSupportedK8sObjects.java @@ -15,9 +15,12 @@ *******************************************************************************/ package com.autotune.analyzer.services; +import com.autotune.analyzer.adapters.DeviceDetailsAdapter; +import com.autotune.analyzer.adapters.RecommendationItemAdapter; import com.autotune.analyzer.serviceObjects.ListSupportedK8sObjectsSO; import com.autotune.analyzer.utils.GsonUTCDateAdapter; import com.autotune.analyzer.utils.AnalyzerConstants; +import com.autotune.common.data.system.info.device.DeviceDetails; import com.autotune.utils.Utils; import com.google.gson.Gson; import com.google.gson.GsonBuilder; @@ -57,6 +60,8 @@ protected void doPost(HttpServletRequest request, HttpServletResponse response) .setPrettyPrinting() .enableComplexMapKeySerialization() .registerTypeAdapter(Date.class, new GsonUTCDateAdapter()) + .registerTypeAdapter(AnalyzerConstants.RecommendationItem.class, new RecommendationItemAdapter()) + .registerTypeAdapter(DeviceDetails.class, new DeviceDetailsAdapter()) .create(); // Convert the Service object to JSON responseGSONString = gsonObj.toJson(listSupportedK8sObjectsSO); diff --git a/src/main/java/com/autotune/analyzer/services/MetricProfileService.java b/src/main/java/com/autotune/analyzer/services/MetricProfileService.java index d4311d07d..ca5372c0e 100644 --- a/src/main/java/com/autotune/analyzer/services/MetricProfileService.java +++ b/src/main/java/com/autotune/analyzer/services/MetricProfileService.java @@ -16,6 +16,8 @@ package com.autotune.analyzer.services; +import com.autotune.analyzer.adapters.DeviceDetailsAdapter; +import com.autotune.analyzer.adapters.RecommendationItemAdapter; import com.autotune.analyzer.exceptions.InvalidValueException; import com.autotune.analyzer.exceptions.PerformanceProfileResponse; import com.autotune.analyzer.performanceProfiles.MetricProfileCollection; @@ -28,6 +30,7 @@ import com.autotune.common.data.ValidationOutputData; import com.autotune.common.data.metrics.Metric; import com.autotune.common.data.result.ContainerData; +import com.autotune.common.data.system.info.device.DeviceDetails; import com.autotune.database.dao.ExperimentDAOImpl; import com.autotune.database.service.ExperimentDBService; import com.autotune.utils.KruizeConstants; @@ -378,6 +381,8 @@ private Gson createGsonObject() { .setPrettyPrinting() .enableComplexMapKeySerialization() .registerTypeAdapter(Date.class, new GsonUTCDateAdapter()) + .registerTypeAdapter(AnalyzerConstants.RecommendationItem.class, new RecommendationItemAdapter()) + .registerTypeAdapter(DeviceDetails.class, new DeviceDetailsAdapter()) // a custom serializer for serializing metadata of JsonNode type. .registerTypeAdapter(JsonNode.class, new JsonSerializer() { @Override diff --git a/src/main/java/com/autotune/analyzer/services/PerformanceProfileService.java b/src/main/java/com/autotune/analyzer/services/PerformanceProfileService.java index 71be6267e..43cc8588f 100644 --- a/src/main/java/com/autotune/analyzer/services/PerformanceProfileService.java +++ b/src/main/java/com/autotune/analyzer/services/PerformanceProfileService.java @@ -16,6 +16,8 @@ package com.autotune.analyzer.services; +import com.autotune.analyzer.adapters.DeviceDetailsAdapter; +import com.autotune.analyzer.adapters.RecommendationItemAdapter; import com.autotune.analyzer.exceptions.InvalidValueException; import com.autotune.analyzer.exceptions.PerformanceProfileResponse; import com.autotune.analyzer.performanceProfiles.PerformanceProfile; @@ -26,6 +28,7 @@ import com.autotune.analyzer.utils.GsonUTCDateAdapter; import com.autotune.common.data.ValidationOutputData; import com.autotune.common.data.metrics.Metric; +import com.autotune.common.data.system.info.device.DeviceDetails; import com.autotune.database.service.ExperimentDBService; import com.google.gson.ExclusionStrategy; import com.google.gson.FieldAttributes; @@ -130,6 +133,8 @@ protected void doGet(HttpServletRequest req, HttpServletResponse response) throw .setPrettyPrinting() .enableComplexMapKeySerialization() .registerTypeAdapter(Date.class, new GsonUTCDateAdapter()) + .registerTypeAdapter(AnalyzerConstants.RecommendationItem.class, new RecommendationItemAdapter()) + .registerTypeAdapter(DeviceDetails.class, new DeviceDetailsAdapter()) .setExclusionStrategies(new ExclusionStrategy() { @Override public boolean shouldSkipField(FieldAttributes f) { diff --git a/src/main/java/com/autotune/analyzer/services/UpdateRecommendations.java b/src/main/java/com/autotune/analyzer/services/UpdateRecommendations.java index 903378655..e558d1d37 100644 --- a/src/main/java/com/autotune/analyzer/services/UpdateRecommendations.java +++ b/src/main/java/com/autotune/analyzer/services/UpdateRecommendations.java @@ -15,15 +15,19 @@ *******************************************************************************/ package com.autotune.analyzer.services; +import com.autotune.analyzer.adapters.DeviceDetailsAdapter; +import com.autotune.analyzer.adapters.RecommendationItemAdapter; import com.autotune.analyzer.exceptions.FetchMetricsError; import com.autotune.analyzer.kruizeObject.KruizeObject; import com.autotune.analyzer.recommendations.engine.RecommendationEngine; import com.autotune.analyzer.serviceObjects.ContainerAPIObject; import com.autotune.analyzer.serviceObjects.Converters; import com.autotune.analyzer.serviceObjects.ListRecommendationsAPIObject; +import com.autotune.analyzer.utils.AnalyzerConstants; import com.autotune.analyzer.utils.AnalyzerErrorConstants; import com.autotune.analyzer.utils.GsonUTCDateAdapter; import com.autotune.common.data.result.ContainerData; +import com.autotune.common.data.system.info.device.DeviceDetails; import com.autotune.operator.KruizeDeploymentInfo; import com.autotune.utils.KruizeConstants; import com.autotune.utils.MetricsConfig; @@ -168,6 +172,8 @@ public boolean shouldSkipClass(Class clazz) { .setPrettyPrinting() .enableComplexMapKeySerialization() .registerTypeAdapter(Date.class, new GsonUTCDateAdapter()) + .registerTypeAdapter(AnalyzerConstants.RecommendationItem.class, new RecommendationItemAdapter()) + .registerTypeAdapter(DeviceDetails.class, new DeviceDetailsAdapter()) .setExclusionStrategies(strategy) .create(); gsonStr = gsonObj.toJson(recommendationList); diff --git a/src/main/java/com/autotune/analyzer/services/UpdateResults.java b/src/main/java/com/autotune/analyzer/services/UpdateResults.java index 7ae38192e..a5d8bbd79 100644 --- a/src/main/java/com/autotune/analyzer/services/UpdateResults.java +++ b/src/main/java/com/autotune/analyzer/services/UpdateResults.java @@ -16,6 +16,8 @@ package com.autotune.analyzer.services; +import com.autotune.analyzer.adapters.DeviceDetailsAdapter; +import com.autotune.analyzer.adapters.RecommendationItemAdapter; import com.autotune.analyzer.exceptions.KruizeResponse; import com.autotune.analyzer.experiment.ExperimentInitiator; import com.autotune.analyzer.performanceProfiles.PerformanceProfile; @@ -23,6 +25,7 @@ import com.autotune.analyzer.serviceObjects.UpdateResultsAPIObject; import com.autotune.analyzer.utils.AnalyzerConstants; import com.autotune.analyzer.utils.AnalyzerErrorConstants; +import com.autotune.common.data.system.info.device.DeviceDetails; import com.autotune.operator.KruizeDeploymentInfo; import com.autotune.utils.MetricsConfig; import com.google.gson.*; @@ -78,6 +81,8 @@ protected void doPost(HttpServletRequest request, HttpServletResponse response) Gson gson = new GsonBuilder() .registerTypeAdapter(Double.class, new CustomNumberDeserializer()) .registerTypeAdapter(Integer.class, new CustomNumberDeserializer()) + .registerTypeAdapter(AnalyzerConstants.RecommendationItem.class, new RecommendationItemAdapter()) + .registerTypeAdapter(DeviceDetails.class, new DeviceDetailsAdapter()) .create(); LOGGER.debug("updateResults API request payload for requestID {} is {}", calCount, inputData); try { diff --git a/src/main/java/com/autotune/analyzer/utils/AnalyzerConstants.java b/src/main/java/com/autotune/analyzer/utils/AnalyzerConstants.java index 4d6b1460a..740bb859a 100644 --- a/src/main/java/com/autotune/analyzer/utils/AnalyzerConstants.java +++ b/src/main/java/com/autotune/analyzer/utils/AnalyzerConstants.java @@ -119,8 +119,31 @@ public enum ExperimentStatus { } public enum RecommendationItem { - cpu, - memory + CPU("cpu"), + MEMORY("memory"), + NVIDIA_GPU("nvidia.com/gpu"), + NVIDIA_GPU_PARTITION_1_CORE_5GB("nvidia.com/mig-1g.5gb"), + NVIDIA_GPU_PARTITION_1_CORE_10GB("nvidia.com/mig-1g.10gb"), + NVIDIA_GPU_PARTITION_1_CORE_20GB("nvidia.com/mig-1g.20gb"), + NVIDIA_GPU_PARTITION_2_CORES_20GB("nvidia.com/mig-2g.20gb"), + NVIDIA_GPU_PARTITION_3_CORES_40GB("nvidia.com/mig-3g.40gb"), + NVIDIA_GPU_PARTITION_4_CORES_40GB("nvidia.com/mig-4g.40gb"), + NVIDIA_GPU_PARTITION_7_CORES_80GB("nvidia.com/mig-7g.80gb"), + NVIDIA_GPU_PARTITION_2_CORES_10GB("nvidia.com/mig-2g.10gb"), + NVIDIA_GPU_PARTITION_3_CORES_20GB("nvidia.com/mig-3g.20gb"), + NVIDIA_GPU_PARTITION_4_CORES_20GB("nvidia.com/mig-4g.20gb"), + NVIDIA_GPU_PARTITION_7_CORES_40GB("nvidia.com/mig-7g.40gb"); + + private final String value; + + RecommendationItem(String value) { + this.value = value; + } + + @Override + public String toString() { + return value; + } } public enum CapacityMax { @@ -196,6 +219,66 @@ public enum RegisterRecommendationModelStatus { INVALID } + public enum DeviceType { + CPU, + MEMORY, + NETWORK, + ACCELERATOR + } + + public enum DeviceParameters { + MODEL_NAME, + UUID, + HOSTNAME, + NAME, + MANUFACTURER, + DEVICE_NAME + } + + public static final class AcceleratorConstants { + private AcceleratorConstants() { + + } + + public static final class AcceleratorMetricConstants { + private AcceleratorMetricConstants() { + + } + + public static final int TIMESTAMP_RANGE_CHECK_IN_MINUTES = 5; + } + + public static final class SupportedAccelerators { + private SupportedAccelerators() { + + } + public static final String A100_80_GB = "A100-80GB"; + public static final String A100_40_GB = "A100-40GB"; + public static final String H100_80_GB = "H100-80GB"; + } + + public static final class AcceleratorProfiles { + private AcceleratorProfiles () { + + } + + // A100 40GB Profiles + public static final String PROFILE_1G_5GB = "1g.5gb"; + public static final String PROFILE_1G_10GB = "1g.10gb"; + public static final String PROFILE_2G_10GB = "2g.10gb"; + public static final String PROFILE_3G_20GB = "3g.20gb"; + public static final String PROFILE_4G_20GB = "4g.20gb"; + public static final String PROFILE_7G_40GB = "7g.40gb"; + + // A100 80GB & H100 80GB Profiles + public static final String PROFILE_1G_20GB = "1g.20gb"; + public static final String PROFILE_2G_20GB = "2g.20gb"; + public static final String PROFILE_3G_40GB = "3g.40gb"; + public static final String PROFILE_4G_40GB = "4g.40gb"; + public static final String PROFILE_7G_80GB = "7g.80gb"; + } + } + public static final class ExperimentTypes { public static final String NAMESPACE_EXPERIMENT = "namespace"; public static final String CONTAINER_EXPERIMENT = "container"; diff --git a/src/main/java/com/autotune/analyzer/workerimpl/BulkJobManager.java b/src/main/java/com/autotune/analyzer/workerimpl/BulkJobManager.java new file mode 100644 index 000000000..c827fd289 --- /dev/null +++ b/src/main/java/com/autotune/analyzer/workerimpl/BulkJobManager.java @@ -0,0 +1,303 @@ +/******************************************************************************* + * Copyright (c) 2020, 2021 Red Hat, IBM Corporation and others. + * + * Licensed under the Apache License, Version 2.0 (the "License"); + * you may not use this file except in compliance with the License. + * You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + *******************************************************************************/ +package com.autotune.analyzer.workerimpl; + + +import com.autotune.analyzer.kruizeObject.KruizeObject; +import com.autotune.analyzer.kruizeObject.RecommendationSettings; +import com.autotune.analyzer.serviceObjects.*; +import com.autotune.analyzer.utils.AnalyzerConstants; +import com.autotune.common.data.ValidationOutputData; +import com.autotune.common.data.dataSourceMetadata.*; +import com.autotune.common.datasource.DataSourceInfo; +import com.autotune.common.datasource.DataSourceManager; +import com.autotune.common.k8sObjects.TrialSettings; +import com.autotune.common.utils.CommonUtils; +import com.autotune.database.service.ExperimentDBService; +import com.autotune.operator.KruizeDeploymentInfo; +import com.autotune.utils.KruizeConstants; +import com.autotune.utils.Utils; +import org.json.JSONObject; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +import java.io.IOException; +import java.net.HttpURLConnection; +import java.net.URL; +import java.sql.Timestamp; +import java.time.Instant; +import java.time.LocalDateTime; +import java.time.ZoneOffset; +import java.time.format.DateTimeFormatter; +import java.util.*; +import java.util.concurrent.ExecutorService; +import java.util.concurrent.Executors; + +import static com.autotune.operator.KruizeDeploymentInfo.bulk_thread_pool_size; +import static com.autotune.utils.KruizeConstants.KRUIZE_BULK_API.*; + + +/** + * The `run` method processes bulk input to create experiments and generates resource optimization recommendations. + * It handles the creation of experiment names based on various data source components, makes HTTP POST requests + * to generate recommendations, and updates job statuses based on the progress of the recommendations. + * + *

+ * Key operations include: + *

    + *
  • Processing 'include' filter labels to generate a unique key.
  • + *
  • Validating and setting the data source if not provided in the input.
  • + *
  • Extracting time range from the input and converting it to epoch time format.
  • + *
  • Fetching metadata information from the data source for the specified time range and labels.
  • + *
  • Creating experiments for each data source component such as clusters, namespaces, workloads, and containers.
  • + *
  • Submitting HTTP POST requests to retrieve recommendations for each created experiment.
  • + *
  • Updating the job status and progress based on the completion of recommendations.
  • + *
+ *

+ * + *

+ * In case of an exception during the process, error messages are logged, and the exception is printed for debugging. + *

+ * + * @throws RuntimeException if URL or HTTP connection setup fails. + * @throws IOException if an error occurs while sending HTTP requests. + */ +public class BulkJobManager implements Runnable { + private static final Logger LOGGER = LoggerFactory.getLogger(BulkJobManager.class); + + private String jobID; + private Map jobStatusMap; + private BulkInput bulkInput; + + + public BulkJobManager(String jobID, Map jobStatusMap, BulkInput payload) { + this.jobID = jobID; + this.jobStatusMap = jobStatusMap; + this.bulkInput = payload; + } + + public static List appendExperiments(List allExperiments, String experimentName) { + allExperiments.add(experimentName); + return allExperiments; + } + + @Override + public void run() { + try { + BulkJobStatus jobData = jobStatusMap.get(jobID); + String uniqueKey = getLabels(this.bulkInput.getFilter()); + if (null == this.bulkInput.getDatasource()) { + this.bulkInput.setDatasource(CREATE_EXPERIMENT_CONFIG_BEAN.getDatasourceName()); + } + DataSourceMetadataInfo metadataInfo = null; + DataSourceManager dataSourceManager = new DataSourceManager(); + DataSourceInfo datasource = CommonUtils.getDataSourceInfo(this.bulkInput.getDatasource()); + JSONObject daterange = processDateRange(this.bulkInput.getTime_range()); + if (null != daterange) + metadataInfo = dataSourceManager.importMetadataFromDataSource(datasource, uniqueKey, (Long) daterange.get("start_time"), (Long) daterange.get("end_time"), (Integer) daterange.get("steps")); + else { + metadataInfo = dataSourceManager.importMetadataFromDataSource(datasource, uniqueKey, 0, 0, 0); + } + if (null == metadataInfo) { + jobData.setStatus(COMPLETED); + jobData.setMessage(NOTHING); + } else { + Map createExperimentAPIObjectMap = getExperimentMap(metadataInfo); //Todo Store this map in buffer and use it if BulkAPI pods restarts and support experiment_type + jobData.setTotal_experiments(createExperimentAPIObjectMap.size()); + jobData.setProcessed_experiments(0); + if (jobData.getTotal_experiments() > KruizeDeploymentInfo.BULK_API_LIMIT) { + jobStatusMap.get(jobID).setStatus(FAILED); + jobStatusMap.get(jobID).setMessage(String.format(LIMIT_MESSAGE, KruizeDeploymentInfo.BULK_API_LIMIT)); + } else { + ExecutorService createExecutor = Executors.newFixedThreadPool(bulk_thread_pool_size); + ExecutorService generateExecutor = Executors.newFixedThreadPool(bulk_thread_pool_size); + for (CreateExperimentAPIObject apiObject : createExperimentAPIObjectMap.values()) { + createExecutor.submit(() -> { + String experiment_name = apiObject.getExperimentName(); + BulkJobStatus.Experiments newExperiments = jobData.getData().getExperiments(); + BulkJobStatus.RecommendationData recommendationData = jobData.getData().getRecommendations().getData(); + try { + ValidationOutputData output = new ExperimentDBService().addExperimentToDB(apiObject); + if (output.isSuccess()) { + jobData.getData().getExperiments().setNewExperiments( + appendExperiments(newExperiments.getNewExperiments(), experiment_name) + ); + } + generateExecutor.submit(() -> { + + jobData.getData().getRecommendations().getData().setUnprocessed( + appendExperiments(recommendationData.getUnprocessed(), experiment_name) + ); + + URL url = null; + HttpURLConnection connection = null; + int statusCode = 0; + try { + url = new URL(String.format(KruizeDeploymentInfo.recommendations_url, experiment_name)); + connection = (HttpURLConnection) url.openConnection(); + connection.setRequestMethod("POST"); + + recommendationData.moveToProgress(experiment_name); + + statusCode = connection.getResponseCode(); + } catch (IOException e) { + LOGGER.error(e.getMessage()); + + recommendationData.moveToFailed(experiment_name); + + throw new RuntimeException(e); + } finally { + if (null != connection) connection.disconnect(); + } + if (statusCode == HttpURLConnection.HTTP_CREATED) { + + recommendationData.moveToCompleted(experiment_name); + jobData.setProcessed_experiments(jobData.getProcessed_experiments() + 1); + + if (jobData.getTotal_experiments() == jobData.getProcessed_experiments()) { + jobData.setStatus(COMPLETED); + jobStatusMap.get(jobID).setEndTime(Instant.now()); + } + + } else { + + recommendationData.moveToFailed(experiment_name); + + } + }); + } catch (Exception e) { + e.printStackTrace(); + recommendationData.moveToFailed(experiment_name); + } + }); + } + } + } + } catch (Exception e) { + LOGGER.error(e.getMessage()); + e.printStackTrace(); + jobStatusMap.get(jobID).setStatus("FAILED"); + } + } + + + Map getExperimentMap(DataSourceMetadataInfo metadataInfo) { + Map createExperimentAPIObjectMap = new HashMap<>(); + Collection dataSourceCollection = metadataInfo.getDataSourceHashMap().values(); + for (DataSource ds : dataSourceCollection) { + HashMap clusterHashMap = ds.getDataSourceClusterHashMap(); + for (DataSourceCluster dsc : clusterHashMap.values()) { + HashMap namespaceHashMap = dsc.getDataSourceNamespaceHashMap(); + for (DataSourceNamespace namespace : namespaceHashMap.values()) { + HashMap dataSourceWorkloadHashMap = namespace.getDataSourceWorkloadHashMap(); + if (dataSourceWorkloadHashMap != null) { + for (DataSourceWorkload dsw : dataSourceWorkloadHashMap.values()) { + HashMap dataSourceContainerHashMap = dsw.getDataSourceContainerHashMap(); + if (dataSourceContainerHashMap != null) { + for (DataSourceContainer dc : dataSourceContainerHashMap.values()) { + CreateExperimentAPIObject createExperimentAPIObject = new CreateExperimentAPIObject(); + createExperimentAPIObject.setMode(CREATE_EXPERIMENT_CONFIG_BEAN.getMode()); + createExperimentAPIObject.setTargetCluster(CREATE_EXPERIMENT_CONFIG_BEAN.getTarget()); + createExperimentAPIObject.setApiVersion(CREATE_EXPERIMENT_CONFIG_BEAN.getVersion()); + String experiment_name = this.bulkInput.getDatasource() + "|" + dsc.getDataSourceClusterName() + "|" + namespace.getDataSourceNamespaceName() + + "|" + dsw.getDataSourceWorkloadName() + "(" + dsw.getDataSourceWorkloadType() + ")" + "|" + dc.getDataSourceContainerName(); + createExperimentAPIObject.setExperimentName(experiment_name); + createExperimentAPIObject.setDatasource(this.bulkInput.getDatasource()); + createExperimentAPIObject.setClusterName(dsc.getDataSourceClusterName()); + createExperimentAPIObject.setPerformanceProfile(CREATE_EXPERIMENT_CONFIG_BEAN.getPerformanceProfile()); + List kubernetesAPIObjectList = new ArrayList<>(); + KubernetesAPIObject kubernetesAPIObject = new KubernetesAPIObject(); + ContainerAPIObject cao = new ContainerAPIObject(dc.getDataSourceContainerName(), + dc.getDataSourceContainerImageName(), null, null); + kubernetesAPIObject.setContainerAPIObjects(Arrays.asList(cao)); + kubernetesAPIObject.setName(dsw.getDataSourceWorkloadName()); + kubernetesAPIObject.setType(dsw.getDataSourceWorkloadType()); + kubernetesAPIObject.setNamespace(namespace.getDataSourceNamespaceName()); + kubernetesAPIObjectList.add(kubernetesAPIObject); + createExperimentAPIObject.setKubernetesObjects(kubernetesAPIObjectList); + RecommendationSettings rs = new RecommendationSettings(); + rs.setThreshold(CREATE_EXPERIMENT_CONFIG_BEAN.getThreshold()); + createExperimentAPIObject.setRecommendationSettings(rs); + TrialSettings trialSettings = new TrialSettings(); + trialSettings.setMeasurement_durationMinutes(CREATE_EXPERIMENT_CONFIG_BEAN.getMeasurementDurationStr()); + createExperimentAPIObject.setTrialSettings(trialSettings); + List kruizeExpList = new ArrayList<>(); + + createExperimentAPIObject.setExperiment_id(Utils.generateID(createExperimentAPIObject.toString())); + createExperimentAPIObject.setStatus(AnalyzerConstants.ExperimentStatus.IN_PROGRESS); + createExperimentAPIObject.setExperimentType(AnalyzerConstants.ExperimentTypes.CONTAINER_EXPERIMENT); + createExperimentAPIObjectMap.put(experiment_name, createExperimentAPIObject); + } + } + } + } + } + } + } + return createExperimentAPIObjectMap; + } + + private String getLabels(BulkInput.FilterWrapper filter) { + String uniqueKey = null; + try { + // Process labels in the 'include' section + if (filter != null && filter.getInclude() != null) { + // Initialize StringBuilder for uniqueKey + StringBuilder includeLabelsBuilder = new StringBuilder(); + Map includeLabels = filter.getInclude().getLabels(); + if (includeLabels != null && !includeLabels.isEmpty()) { + includeLabels.forEach((key, value) -> + includeLabelsBuilder.append(key).append("=").append("\"" + value + "\"").append(",") + ); + // Remove trailing comma + if (includeLabelsBuilder.length() > 0) { + includeLabelsBuilder.setLength(includeLabelsBuilder.length() - 1); + } + LOGGER.debug("Include Labels: " + includeLabelsBuilder.toString()); + uniqueKey = includeLabelsBuilder.toString(); + } + } + } catch (Exception e) { + e.printStackTrace(); + LOGGER.error(e.getMessage()); + } + return uniqueKey; + } + + private JSONObject processDateRange(BulkInput.TimeRange timeRange) { + JSONObject dateRange = null; + if (null != timeRange && timeRange.getStart() != null && timeRange.getEnd() != null) { + String intervalEndTimeStr = timeRange.getStart(); + String intervalStartTimeStr = timeRange.getEnd(); + long interval_end_time_epoc = 0; + long interval_start_time_epoc = 0; + LocalDateTime localDateTime = LocalDateTime.parse(intervalEndTimeStr, DateTimeFormatter.ofPattern(KruizeConstants.DateFormats.STANDARD_JSON_DATE_FORMAT)); + interval_end_time_epoc = localDateTime.toEpochSecond(ZoneOffset.UTC); + Timestamp interval_end_time = Timestamp.from(localDateTime.toInstant(ZoneOffset.UTC)); + localDateTime = LocalDateTime.parse(intervalStartTimeStr, DateTimeFormatter.ofPattern(KruizeConstants.DateFormats.STANDARD_JSON_DATE_FORMAT)); + interval_start_time_epoc = localDateTime.toEpochSecond(ZoneOffset.UTC); + Timestamp interval_start_time = Timestamp.from(localDateTime.toInstant(ZoneOffset.UTC)); + int steps = CREATE_EXPERIMENT_CONFIG_BEAN.getMeasurementDuration() * KruizeConstants.TimeConv.NO_OF_SECONDS_PER_MINUTE; // todo fetch experiment recommendations setting measurement + dateRange = new JSONObject(); + dateRange.put("start_time", interval_start_time_epoc); + dateRange.put("end_time", interval_end_time_epoc); + dateRange.put("steps", steps); + } + return dateRange; + } + + +} diff --git a/src/main/java/com/autotune/common/data/dataSourceQueries/DataSourceQueries.java b/src/main/java/com/autotune/common/data/dataSourceQueries/DataSourceQueries.java index ccf20f8c6..dbddbe7d4 100644 --- a/src/main/java/com/autotune/common/data/dataSourceQueries/DataSourceQueries.java +++ b/src/main/java/com/autotune/common/data/dataSourceQueries/DataSourceQueries.java @@ -7,9 +7,10 @@ */ public class DataSourceQueries { public enum PromQLQuery { - NAMESPACE_QUERY("sum by (namespace) ( avg_over_time(kube_namespace_status_phase{namespace!=\"\"}[15d]))"), - WORKLOAD_INFO_QUERY("sum by (namespace, workload, workload_type) ( avg_over_time(namespace_workload_pod:kube_pod_owner:relabel{workload!=\"\"}[15d]))"), - CONTAINER_INFO_QUERY("sum by (container, image, workload, workload_type, namespace) ( avg_over_time(kube_pod_container_info{container!=\"\"}[15d]) * on (pod, namespace) group_left(workload, workload_type) avg_over_time(namespace_workload_pod:kube_pod_owner:relabel{workload!=\"\"}[15d]))"); + NAMESPACE_QUERY("sum by (namespace) ( avg_over_time(kube_namespace_status_phase{namespace!=\"\" ADDITIONAL_LABEL}[15d]))"), + WORKLOAD_INFO_QUERY("sum by (namespace, workload, workload_type) ( avg_over_time(namespace_workload_pod:kube_pod_owner:relabel{workload!=\"\" ADDITIONAL_LABEL}[15d]))"), + CONTAINER_INFO_QUERY("sum by (container, image, workload, workload_type, namespace) ( avg_over_time(kube_pod_container_info{container!=\"\" ADDITIONAL_LABEL }[15d]) * on (pod, namespace) group_left(workload, workload_type) avg_over_time(namespace_workload_pod:kube_pod_owner:relabel{workload!=\"\" ADDITIONAL_LABEL}[15d]))"); + private final String query; PromQLQuery(String query) { diff --git a/src/main/java/com/autotune/common/data/metrics/AcceleratorMetricResult.java b/src/main/java/com/autotune/common/data/metrics/AcceleratorMetricResult.java new file mode 100644 index 000000000..01f570ecb --- /dev/null +++ b/src/main/java/com/autotune/common/data/metrics/AcceleratorMetricResult.java @@ -0,0 +1,29 @@ +package com.autotune.common.data.metrics; + +import com.autotune.common.data.system.info.device.accelerator.AcceleratorDeviceData; + +public class AcceleratorMetricResult { + private AcceleratorDeviceData acceleratorDeviceData; + private MetricResults metricResults; + + public AcceleratorMetricResult(AcceleratorDeviceData acceleratorDeviceData, MetricResults metricResults) { + this.acceleratorDeviceData = acceleratorDeviceData; + this.metricResults = metricResults; + } + + public AcceleratorDeviceData getAcceleratorDeviceData() { + return acceleratorDeviceData; + } + + public void setAcceleratorDeviceData(AcceleratorDeviceData acceleratorDeviceData) { + this.acceleratorDeviceData = acceleratorDeviceData; + } + + public MetricResults getMetricResults() { + return metricResults; + } + + public void setMetricResults(MetricResults metricResults) { + this.metricResults = metricResults; + } +} diff --git a/src/main/java/com/autotune/common/data/result/ContainerData.java b/src/main/java/com/autotune/common/data/result/ContainerData.java index 4f7afcc7f..66aa1dfc5 100644 --- a/src/main/java/com/autotune/common/data/result/ContainerData.java +++ b/src/main/java/com/autotune/common/data/result/ContainerData.java @@ -18,6 +18,7 @@ import com.autotune.analyzer.recommendations.ContainerRecommendations; import com.autotune.analyzer.utils.AnalyzerConstants; import com.autotune.common.data.metrics.Metric; +import com.autotune.common.data.system.info.device.ContainerDeviceList; import com.autotune.utils.KruizeConstants; import com.google.gson.annotations.SerializedName; @@ -29,6 +30,7 @@ public class ContainerData { private String container_name; //key is intervalEndTime private HashMap results; + private ContainerDeviceList containerDeviceList; @SerializedName(KruizeConstants.JSONKeys.RECOMMENDATIONS) private ContainerRecommendations containerRecommendations; private HashMap metrics; @@ -85,6 +87,14 @@ public HashMap getMetrics() { public void setMetrics(HashMap metrics) { this.metrics = metrics; } + + public ContainerDeviceList getContainerDeviceList() { + return containerDeviceList; + } + + public void setContainerDeviceList(ContainerDeviceList containerDeviceList) { + this.containerDeviceList = containerDeviceList; + } @Override public String toString() { return "ContainerData{" + diff --git a/src/main/java/com/autotune/common/data/result/IntervalResults.java b/src/main/java/com/autotune/common/data/result/IntervalResults.java index e9bd880f3..327681690 100644 --- a/src/main/java/com/autotune/common/data/result/IntervalResults.java +++ b/src/main/java/com/autotune/common/data/result/IntervalResults.java @@ -16,6 +16,7 @@ package com.autotune.common.data.result; import com.autotune.analyzer.utils.AnalyzerConstants; +import com.autotune.common.data.metrics.AcceleratorMetricResult; import com.autotune.common.data.metrics.MetricResults; import com.google.gson.annotations.SerializedName; @@ -32,6 +33,7 @@ public class IntervalResults { @SerializedName(METRICS) HashMap metricResultsMap; + HashMap acceleratorMetricResultHashMap; @SerializedName(INTERVAL_START_TIME) private Timestamp intervalStartTime; @SerializedName(INTERVAL_END_TIME) @@ -85,6 +87,14 @@ public void setDurationInMinutes(Double durationInMinutes) { this.durationInMinutes = durationInMinutes; } + public HashMap getAcceleratorMetricResultHashMap() { + return acceleratorMetricResultHashMap; + } + + public void setAcceleratorMetricResultHashMap(HashMap acceleratorMetricResultHashMap) { + this.acceleratorMetricResultHashMap = acceleratorMetricResultHashMap; + } + @Override public String toString() { return "IntervalResults{" + diff --git a/src/main/java/com/autotune/common/data/system/info/device/ContainerDeviceList.java b/src/main/java/com/autotune/common/data/system/info/device/ContainerDeviceList.java new file mode 100644 index 000000000..00de9e322 --- /dev/null +++ b/src/main/java/com/autotune/common/data/system/info/device/ContainerDeviceList.java @@ -0,0 +1,144 @@ +package com.autotune.common.data.system.info.device; + +import com.autotune.analyzer.utils.AnalyzerConstants; +import com.autotune.common.data.system.info.device.accelerator.AcceleratorDeviceData; + +import java.util.ArrayList; +import java.util.HashMap; + +/** + * This class stores the device entries linked to the container + */ +public class ContainerDeviceList implements DeviceHandler, DeviceComponentDetector { + private final HashMap> deviceMap; + private boolean isAcceleratorDeviceDetected; + private boolean isCPUDeviceDetected; + private boolean isMemoryDeviceDetected; + private boolean isNetworkDeviceDetected; + + public ContainerDeviceList(){ + this.deviceMap = new HashMap>(); + this.isAcceleratorDeviceDetected = false; + // Currently setting up CPU, Memory and Network as true by default + this.isCPUDeviceDetected = true; + this.isMemoryDeviceDetected = true; + this.isNetworkDeviceDetected = true; + } + + @Override + public void addDevice(AnalyzerConstants.DeviceType deviceType, DeviceDetails deviceInfo) { + if (null == deviceType || null == deviceInfo) { + // TODO: Handle appropriate returns in future + return; + } + + if (deviceType == AnalyzerConstants.DeviceType.ACCELERATOR) + this.isAcceleratorDeviceDetected = true; + + // TODO: Handle multiple same entries + // Currently only first MIG is getting added so no check for existing duplicates is done + if (null == deviceMap.get(deviceType)) { + ArrayList deviceDetailsList = new ArrayList(); + deviceDetailsList.add(deviceInfo); + this.deviceMap.put(deviceType, deviceDetailsList); + } else { + this.deviceMap.get(deviceType).add(deviceInfo); + } + } + + @Override + public void removeDevice(AnalyzerConstants.DeviceType deviceType, DeviceDetails deviceInfo) { + if (null == deviceType || null == deviceInfo) { + // TODO: Handle appropriate returns in future + return; + } + // TODO: Need to be implemented if we need a dynamic experiment device updates + if (deviceType == AnalyzerConstants.DeviceType.ACCELERATOR) { + if (null == deviceMap.get(deviceType) || this.deviceMap.get(deviceType).isEmpty()) { + this.isAcceleratorDeviceDetected = false; + } + } + } + + @Override + public void updateDevice(AnalyzerConstants.DeviceType deviceType, DeviceDetails deviceInfo) { + // TODO: Need to be implemented if we need a dynamic experiment device updates + } + + /** + * Returns the Device which matches the identifier based on the device parameter passed + * @param deviceType - Type of the device Eg: CPU, Memory, Network or Accelerator + * @param matchIdentifier - String which needs to the matched + * @param deviceParameters - Parameter to search in device details list + * @return the appropriate DeviceDetails object + * + * USE CASE: To search the device based on a particular parameter, Let's say you have multiple accelerators + * to the container, you can pass the Model name as parameter and name of model to get the particular + * DeviceDetail object. + */ + @Override + public DeviceDetails getDeviceByParameter(AnalyzerConstants.DeviceType deviceType, String matchIdentifier, AnalyzerConstants.DeviceParameters deviceParameters) { + if (null == deviceType) + return null; + if (null == matchIdentifier) + return null; + if (null == deviceParameters) + return null; + if (matchIdentifier.isEmpty()) + return null; + if (!deviceMap.containsKey(deviceType)) + return null; + if (null == deviceMap.get(deviceType)) + return null; + if (deviceMap.get(deviceType).isEmpty()) + return null; + + // Todo: Need to add extractors for each device type currently implementing for GPU + if (deviceType == AnalyzerConstants.DeviceType.ACCELERATOR) { + for (DeviceDetails deviceDetails: deviceMap.get(deviceType)) { + AcceleratorDeviceData deviceData = (AcceleratorDeviceData) deviceDetails; + if (deviceParameters == AnalyzerConstants.DeviceParameters.MODEL_NAME) { + if (deviceData.getModelName().equalsIgnoreCase(matchIdentifier)) { + return deviceData; + } + } + } + } + + return null; + } + + @Override + public ArrayList getDevices(AnalyzerConstants.DeviceType deviceType) { + if (null == deviceType) + return null; + if (!deviceMap.containsKey(deviceType)) + return null; + if (null == deviceMap.get(deviceType)) + return null; + if (deviceMap.get(deviceType).isEmpty()) + return null; + + return deviceMap.get(deviceType); + } + + @Override + public boolean isAcceleratorDeviceDetected() { + return this.isAcceleratorDeviceDetected; + } + + @Override + public boolean isCPUDeviceDetected() { + return this.isCPUDeviceDetected; + } + + @Override + public boolean isMemoryDeviceDetected() { + return this.isMemoryDeviceDetected; + } + + @Override + public boolean isNetworkDeviceDetected() { + return this.isNetworkDeviceDetected; + } +} diff --git a/src/main/java/com/autotune/common/data/system/info/device/DeviceComponentDetector.java b/src/main/java/com/autotune/common/data/system/info/device/DeviceComponentDetector.java new file mode 100644 index 000000000..249ba9c55 --- /dev/null +++ b/src/main/java/com/autotune/common/data/system/info/device/DeviceComponentDetector.java @@ -0,0 +1,8 @@ +package com.autotune.common.data.system.info.device; + +public interface DeviceComponentDetector { + public boolean isAcceleratorDeviceDetected(); + public boolean isCPUDeviceDetected(); + public boolean isMemoryDeviceDetected(); + public boolean isNetworkDeviceDetected(); +} diff --git a/src/main/java/com/autotune/common/data/system/info/device/DeviceDetails.java b/src/main/java/com/autotune/common/data/system/info/device/DeviceDetails.java new file mode 100644 index 000000000..584891b60 --- /dev/null +++ b/src/main/java/com/autotune/common/data/system/info/device/DeviceDetails.java @@ -0,0 +1,7 @@ +package com.autotune.common.data.system.info.device; + +import com.autotune.analyzer.utils.AnalyzerConstants; + +public interface DeviceDetails { + public AnalyzerConstants.DeviceType getType(); +} diff --git a/src/main/java/com/autotune/common/data/system/info/device/DeviceHandler.java b/src/main/java/com/autotune/common/data/system/info/device/DeviceHandler.java new file mode 100644 index 000000000..447716440 --- /dev/null +++ b/src/main/java/com/autotune/common/data/system/info/device/DeviceHandler.java @@ -0,0 +1,15 @@ +package com.autotune.common.data.system.info.device; + +import com.autotune.analyzer.utils.AnalyzerConstants; + +import java.util.ArrayList; + +public interface DeviceHandler { + public void addDevice(AnalyzerConstants.DeviceType deviceType, DeviceDetails deviceInfo); + public void removeDevice(AnalyzerConstants.DeviceType deviceType, DeviceDetails deviceInfo); + public void updateDevice(AnalyzerConstants.DeviceType deviceType, DeviceDetails deviceInfo); + public DeviceDetails getDeviceByParameter(AnalyzerConstants.DeviceType deviceType, + String matchIdentifier, + AnalyzerConstants.DeviceParameters deviceParameters); + public ArrayList getDevices(AnalyzerConstants.DeviceType deviceType); +} diff --git a/src/main/java/com/autotune/common/data/system/info/device/accelerator/AcceleratorDeviceData.java b/src/main/java/com/autotune/common/data/system/info/device/accelerator/AcceleratorDeviceData.java new file mode 100644 index 000000000..a3a09fead --- /dev/null +++ b/src/main/java/com/autotune/common/data/system/info/device/accelerator/AcceleratorDeviceData.java @@ -0,0 +1,59 @@ +package com.autotune.common.data.system.info.device.accelerator; + +import com.autotune.analyzer.utils.AnalyzerConstants; + +public class AcceleratorDeviceData implements AcceleratorDeviceDetails { + private final String manufacturer; + private final String modelName; + private final String hostName; + private final String UUID; + private final String deviceName; + private boolean isMIG; + + public AcceleratorDeviceData (String modelName, String hostName, String UUID, String deviceName, boolean isMIG) { + this.manufacturer = "NVIDIA"; + this.modelName = modelName; + this.hostName = hostName; + this.UUID = UUID; + this.deviceName = deviceName; + this.isMIG = isMIG; + } + + @Override + public String getManufacturer() { + return this.manufacturer; + } + + @Override + public String getModelName() { + return modelName; + } + + @Override + public String getHostName() { + return hostName; + } + + @Override + public String getUUID() { + return UUID; + } + + @Override + public String getDeviceName() { + return deviceName; + } + + public boolean isMIG() { + return isMIG; + } + + public void setMIG(boolean isMIG) { + this.isMIG = isMIG; + } + + @Override + public AnalyzerConstants.DeviceType getType() { + return AnalyzerConstants.DeviceType.ACCELERATOR; + } +} diff --git a/src/main/java/com/autotune/common/data/system/info/device/accelerator/AcceleratorDeviceDetails.java b/src/main/java/com/autotune/common/data/system/info/device/accelerator/AcceleratorDeviceDetails.java new file mode 100644 index 000000000..31b90ff66 --- /dev/null +++ b/src/main/java/com/autotune/common/data/system/info/device/accelerator/AcceleratorDeviceDetails.java @@ -0,0 +1,11 @@ +package com.autotune.common.data.system.info.device.accelerator; + +import com.autotune.common.data.system.info.device.DeviceDetails; + +public interface AcceleratorDeviceDetails extends DeviceDetails { + public String getManufacturer(); + public String getModelName(); + public String getHostName(); + public String getUUID(); + public String getDeviceName(); +} diff --git a/src/main/java/com/autotune/common/data/system/info/device/accelerator/metadata/AcceleratorMetaDataService.java b/src/main/java/com/autotune/common/data/system/info/device/accelerator/metadata/AcceleratorMetaDataService.java new file mode 100644 index 000000000..6a5fd8187 --- /dev/null +++ b/src/main/java/com/autotune/common/data/system/info/device/accelerator/metadata/AcceleratorMetaDataService.java @@ -0,0 +1,103 @@ +package com.autotune.common.data.system.info.device.accelerator.metadata; + + + +import com.autotune.analyzer.utils.AnalyzerConstants; + +import java.util.ArrayList; +import java.util.HashMap; +import java.util.List; +import java.util.Map; + +/** + * A service which is created to provide the respective Accelerator Profile + * based on SM and Memory requirements + * + * This service initially loads the profiles of supported Accelerators + * Currently it supports: + * NVIDIA A100 40GB + * NVIDIA A100 80GB + * NVIDIA H100 80GB + */ +public class AcceleratorMetaDataService { + private static Map> acceleratorProfilesMap; + private static AcceleratorMetaDataService acceleratorMetaDataService = null; + + /** + * + */ + private AcceleratorMetaDataService(){ + acceleratorProfilesMap = new HashMap<>(); + initializeAcceleratorProfiles(); + } + + private static void initializeAcceleratorProfiles() { + List commonProfiles = new ArrayList<>(); + // IMPORTANT: Add it in the ascending order according to GPU Core and Memory Units as we will break the loop upon getting the right one + commonProfiles.add(new AcceleratorProfile(AnalyzerConstants.AcceleratorConstants.AcceleratorProfiles.PROFILE_1G_10GB, + 1.0 / 8, 1.0 / 7, 7)); + commonProfiles.add(new AcceleratorProfile(AnalyzerConstants.AcceleratorConstants.AcceleratorProfiles.PROFILE_1G_20GB, + 1.0 / 4, 1.0 / 7, 4)); + commonProfiles.add(new AcceleratorProfile(AnalyzerConstants.AcceleratorConstants.AcceleratorProfiles.PROFILE_2G_20GB, + 2.0 / 8, 2.0 / 7, 3)); + commonProfiles.add(new AcceleratorProfile(AnalyzerConstants.AcceleratorConstants.AcceleratorProfiles.PROFILE_3G_40GB, + 4.0 / 8, 3.0 / 7, 2)); + commonProfiles.add(new AcceleratorProfile(AnalyzerConstants.AcceleratorConstants.AcceleratorProfiles.PROFILE_4G_40GB, + 4.0 / 8, 4.0 / 7, 1)); + commonProfiles.add(new AcceleratorProfile(AnalyzerConstants.AcceleratorConstants.AcceleratorProfiles.PROFILE_7G_80GB, + 1.0, 1.0, 1)); + + List a100_40_gb_profiles = new ArrayList<>(); + // IMPORTANT: Add it in the ascending order according to GPU Core and Memory Units as we will break the loop upon getting the right one + a100_40_gb_profiles.add(new AcceleratorProfile(AnalyzerConstants.AcceleratorConstants.AcceleratorProfiles.PROFILE_1G_5GB, + 1.0 / 8, 1.0 / 7, 7)); + a100_40_gb_profiles.add(new AcceleratorProfile(AnalyzerConstants.AcceleratorConstants.AcceleratorProfiles.PROFILE_1G_10GB, + 1.0 / 4, 1.0 / 7, 4)); + a100_40_gb_profiles.add(new AcceleratorProfile(AnalyzerConstants.AcceleratorConstants.AcceleratorProfiles.PROFILE_2G_10GB, + 2.0 / 8, 2.0 / 7, 3)); + a100_40_gb_profiles.add(new AcceleratorProfile(AnalyzerConstants.AcceleratorConstants.AcceleratorProfiles.PROFILE_3G_20GB, + 4.0 / 8, 3.0 / 7, 2)); + a100_40_gb_profiles.add(new AcceleratorProfile(AnalyzerConstants.AcceleratorConstants.AcceleratorProfiles.PROFILE_4G_20GB, + 4.0 / 8, 4.0 / 7, 1)); + a100_40_gb_profiles.add(new AcceleratorProfile(AnalyzerConstants.AcceleratorConstants.AcceleratorProfiles.PROFILE_7G_40GB, + 1.0, 1.0, 1)); + + acceleratorProfilesMap.put(AnalyzerConstants.AcceleratorConstants.SupportedAccelerators.A100_80_GB, new ArrayList<>(commonProfiles)); + acceleratorProfilesMap.put(AnalyzerConstants.AcceleratorConstants.SupportedAccelerators.H100_80_GB, new ArrayList<>(commonProfiles)); + acceleratorProfilesMap.put(AnalyzerConstants.AcceleratorConstants.SupportedAccelerators.A100_40_GB, new ArrayList<>(a100_40_gb_profiles)); + } + + public static AcceleratorMetaDataService getInstance() { + if(null == acceleratorMetaDataService) { + synchronized (AcceleratorMetaDataService.class) { + if (null == acceleratorMetaDataService) { + acceleratorMetaDataService = new AcceleratorMetaDataService(); + } + } + } + return acceleratorMetaDataService; + } + + public AcceleratorProfile getAcceleratorProfile(String modelName, Double requiredSmFraction, Double requiredMemoryFraction) { + if (null == modelName || null == requiredSmFraction || null == requiredMemoryFraction) { + return null; + } + modelName = modelName.strip(); + if (!modelName.equalsIgnoreCase(AnalyzerConstants.AcceleratorConstants.SupportedAccelerators.A100_80_GB) + && !modelName.equalsIgnoreCase(AnalyzerConstants.AcceleratorConstants.SupportedAccelerators.H100_80_GB) + && !modelName.equalsIgnoreCase(AnalyzerConstants.AcceleratorConstants.SupportedAccelerators.A100_40_GB)) { + return null; + } + if (requiredMemoryFraction < 0.0 || requiredSmFraction < 0.0) { + return null; + } + List gpuProfiles = acceleratorProfilesMap.get(modelName); + for (AcceleratorProfile profile : gpuProfiles) { + if (profile.getMemoryFraction() >= requiredMemoryFraction && profile.getSmFraction() >= requiredSmFraction) { + // Returning the profile as the list is in ascending order + return profile; + } + } + return null; + } +} diff --git a/src/main/java/com/autotune/common/data/system/info/device/accelerator/metadata/AcceleratorProfile.java b/src/main/java/com/autotune/common/data/system/info/device/accelerator/metadata/AcceleratorProfile.java new file mode 100644 index 000000000..c0db82b50 --- /dev/null +++ b/src/main/java/com/autotune/common/data/system/info/device/accelerator/metadata/AcceleratorProfile.java @@ -0,0 +1,51 @@ +package com.autotune.common.data.system.info.device.accelerator.metadata; + +/** + * Class which is used to store the details of an accelerator profile + */ +public class AcceleratorProfile { + private final String profileName; + private final double memoryFraction; + private final double smFraction; + private final int instancesAvailable; + + /** + * Constructor to create the Accelerator Profile + * @param profileName - Name of the profile + * @param memoryFraction - Fraction of memory out of the whole accelerator memory + * @param smFraction - Fraction of Cores or Streaming Processors out if the whole accelerator cores + * @param instancesAvailable - Number of instances of a profile available on an Accelerator + */ + public AcceleratorProfile(String profileName, double memoryFraction, double smFraction, int instancesAvailable) { + this.profileName = profileName; + this.memoryFraction = memoryFraction; + this.smFraction = smFraction; + this.instancesAvailable = instancesAvailable; + } + + public String getProfileName() { + return this.profileName; + } + + public double getMemoryFraction() { + return memoryFraction; + } + + public double getSmFraction() { + return smFraction; + } + + public int getInstancesAvailable() { + return instancesAvailable; + } + + @Override + public String toString() { + return "AcceleratorProfile{" + + "profileName='" + profileName + '\'' + + ", memoryFraction=" + memoryFraction + + ", smFraction=" + smFraction + + ", instancesAvailable=" + instancesAvailable + + '}'; + } +} diff --git a/src/main/java/com/autotune/common/datasource/DataSourceManager.java b/src/main/java/com/autotune/common/datasource/DataSourceManager.java index 441a70516..94bfc4ce5 100644 --- a/src/main/java/com/autotune/common/datasource/DataSourceManager.java +++ b/src/main/java/com/autotune/common/datasource/DataSourceManager.java @@ -1,3 +1,18 @@ +/******************************************************************************* + * Copyright (c) 2020, 2021 Red Hat, IBM Corporation and others. + * + * Licensed under the Apache License, Version 2.0 (the "License"); + * you may not use this file except in compliance with the License. + * You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + *******************************************************************************/ package com.autotune.common.datasource; import com.autotune.analyzer.utils.AnalyzerErrorConstants; @@ -32,13 +47,19 @@ public DataSourceManager() { /** * Imports Metadata for a specific data source using associated DataSourceInfo. + * @param dataSourceInfo + * @param uniqueKey this is used as labels in query example container="xyz" namespace="abc" + * @param startTime Get metadata from starttime to endtime + * @param endTime Get metadata from starttime to endtime + * @param steps the interval between data points in a range query + * @return */ - public DataSourceMetadataInfo importMetadataFromDataSource(DataSourceInfo dataSourceInfo) { + public DataSourceMetadataInfo importMetadataFromDataSource(DataSourceInfo dataSourceInfo,String uniqueKey,long startTime,long endTime,int steps) { try { if (null == dataSourceInfo) { throw new DataSourceDoesNotExist(KruizeConstants.DataSourceConstants.DataSourceErrorMsgs.MISSING_DATASOURCE_INFO); } - DataSourceMetadataInfo dataSourceMetadataInfo = dataSourceMetadataOperator.createDataSourceMetadata(dataSourceInfo); + DataSourceMetadataInfo dataSourceMetadataInfo = dataSourceMetadataOperator.createDataSourceMetadata(dataSourceInfo,uniqueKey, startTime, endTime, steps); if (null == dataSourceMetadataInfo) { LOGGER.error(KruizeConstants.DataSourceConstants.DataSourceMetadataErrorMsgs.DATASOURCE_METADATA_INFO_NOT_AVAILABLE, "for datasource {}" + dataSourceInfo.getName()); return null; @@ -91,7 +112,7 @@ public void updateMetadataFromDataSource(DataSourceInfo dataSource, DataSourceMe if (null == dataSourceMetadataInfo) { throw new DataSourceDoesNotExist(KruizeConstants.DataSourceConstants.DataSourceMetadataErrorMsgs.DATASOURCE_METADATA_INFO_NOT_AVAILABLE); } - dataSourceMetadataOperator.updateDataSourceMetadata(dataSource); + dataSourceMetadataOperator.updateDataSourceMetadata(dataSource,"",0,0,0); } catch (Exception e) { LOGGER.error(e.getMessage()); } diff --git a/src/main/java/com/autotune/common/datasource/DataSourceMetadataOperator.java b/src/main/java/com/autotune/common/datasource/DataSourceMetadataOperator.java index d1079564b..bd51e797b 100644 --- a/src/main/java/com/autotune/common/datasource/DataSourceMetadataOperator.java +++ b/src/main/java/com/autotune/common/datasource/DataSourceMetadataOperator.java @@ -1,3 +1,18 @@ +/******************************************************************************* + * Copyright (c) 2020, 2021 Red Hat, IBM Corporation and others. + * + * Licensed under the Apache License, Version 2.0 (the "License"); + * you may not use this file except in compliance with the License. + * You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + *******************************************************************************/ package com.autotune.common.datasource; import com.autotune.common.data.dataSourceQueries.PromQLDataSourceQueries; @@ -31,10 +46,14 @@ public class DataSourceMetadataOperator { * Currently supported DataSourceProvider - Prometheus * * @param dataSourceInfo The DataSourceInfo object containing information about the data source. + * @param uniqueKey this is used as labels in query example container="xyz" namespace="abc" + * @param startTime Get metadata from starttime to endtime + * @param endTime Get metadata from starttime to endtime + * @param steps the interval between data points in a range query * TODO - support multiple data sources */ - public DataSourceMetadataInfo createDataSourceMetadata(DataSourceInfo dataSourceInfo) { - return processQueriesAndPopulateDataSourceMetadataInfo(dataSourceInfo); + public DataSourceMetadataInfo createDataSourceMetadata(DataSourceInfo dataSourceInfo, String uniqueKey, long startTime, long endTime, int steps) { + return processQueriesAndPopulateDataSourceMetadataInfo(dataSourceInfo, uniqueKey, startTime, endTime, steps); } /** @@ -75,8 +94,8 @@ public DataSourceMetadataInfo getDataSourceMetadataInfo(DataSourceInfo dataSourc * TODO - Currently Create and Update functions have identical functionalities, based on UI workflow and requirements * need to further enhance updateDataSourceMetadata() to support namespace, workload level granular updates */ - public DataSourceMetadataInfo updateDataSourceMetadata(DataSourceInfo dataSourceInfo) { - return processQueriesAndPopulateDataSourceMetadataInfo(dataSourceInfo); + public DataSourceMetadataInfo updateDataSourceMetadata(DataSourceInfo dataSourceInfo, String uniqueKey, long startTime, long endTime, int steps) { + return processQueriesAndPopulateDataSourceMetadataInfo(dataSourceInfo, uniqueKey, startTime, endTime, steps); } /** @@ -108,9 +127,14 @@ public void deleteDataSourceMetadata(DataSourceInfo dataSourceInfo) { * DataSourceMetadataInfo object * * @param dataSourceInfo The DataSourceInfo object containing information about the data source + * @param uniqueKey this is used as labels in query example container="xyz" namespace="abc" + * @param startTime Get metadata from starttime to endtime + * @param endTime Get metadata from starttime to endtime + * @param steps the interval between data points in a range query * @return DataSourceMetadataInfo object with populated metadata fields + * todo rename processQueriesAndFetchClusterMetadataInfo */ - public DataSourceMetadataInfo processQueriesAndPopulateDataSourceMetadataInfo(DataSourceInfo dataSourceInfo) { + public DataSourceMetadataInfo processQueriesAndPopulateDataSourceMetadataInfo(DataSourceInfo dataSourceInfo, String uniqueKey, long startTime, long endTime, int steps) { DataSourceMetadataHelper dataSourceDetailsHelper = new DataSourceMetadataHelper(); /** * Get DataSourceOperatorImpl instance on runtime based on dataSource provider @@ -129,8 +153,25 @@ public DataSourceMetadataInfo processQueriesAndPopulateDataSourceMetadataInfo(Da */ try { String dataSourceName = dataSourceInfo.getName(); - JsonArray namespacesDataResultArray = op.getResultArrayForQuery(dataSourceInfo, PromQLDataSourceQueries.NAMESPACE_QUERY); - if (false == op.validateResultArray(namespacesDataResultArray)){ + String namespaceQuery = PromQLDataSourceQueries.NAMESPACE_QUERY; + String workloadQuery = PromQLDataSourceQueries.WORKLOAD_QUERY; + String containerQuery = PromQLDataSourceQueries.CONTAINER_QUERY; + if (null != uniqueKey) { + LOGGER.info("uniquekey: {}", uniqueKey); + namespaceQuery = namespaceQuery.replace("ADDITIONAL_LABEL", "," + uniqueKey); + workloadQuery = workloadQuery.replace("ADDITIONAL_LABEL", "," + uniqueKey); + containerQuery = containerQuery.replace("ADDITIONAL_LABEL", "," + uniqueKey); + } else { + namespaceQuery = namespaceQuery.replace("ADDITIONAL_LABEL", ""); + workloadQuery = workloadQuery.replace("ADDITIONAL_LABEL", ""); + containerQuery = containerQuery.replace("ADDITIONAL_LABEL", ""); + } + LOGGER.info("namespaceQuery: {}", namespaceQuery); + LOGGER.info("workloadQuery: {}", workloadQuery); + LOGGER.info("containerQuery: {}", containerQuery); + + JsonArray namespacesDataResultArray = op.getResultArrayForQuery(dataSourceInfo, namespaceQuery); + if (false == op.validateResultArray(namespacesDataResultArray)) { dataSourceMetadataInfo = dataSourceDetailsHelper.createDataSourceMetadataInfoObject(dataSourceName, null); throw new Exception(KruizeConstants.DataSourceConstants.DataSourceMetadataErrorMsgs.NAMESPACE_QUERY_VALIDATION_FAILED); } @@ -153,7 +194,7 @@ public DataSourceMetadataInfo processQueriesAndPopulateDataSourceMetadataInfo(Da */ HashMap> datasourceWorkloads = new HashMap<>(); JsonArray workloadDataResultArray = op.getResultArrayForQuery(dataSourceInfo, - PromQLDataSourceQueries.WORKLOAD_QUERY); + workloadQuery); if (op.validateResultArray(workloadDataResultArray)) { datasourceWorkloads = dataSourceDetailsHelper.getWorkloadInfo(workloadDataResultArray); @@ -172,7 +213,7 @@ public DataSourceMetadataInfo processQueriesAndPopulateDataSourceMetadataInfo(Da */ HashMap> datasourceContainers = new HashMap<>(); JsonArray containerDataResultArray = op.getResultArrayForQuery(dataSourceInfo, - PromQLDataSourceQueries.CONTAINER_QUERY); + containerQuery); if (op.validateResultArray(containerDataResultArray)) { datasourceContainers = dataSourceDetailsHelper.getContainerInfo(containerDataResultArray); diff --git a/src/main/java/com/autotune/common/utils/CommonUtils.java b/src/main/java/com/autotune/common/utils/CommonUtils.java index 384bc5dc3..ddd965d6e 100644 --- a/src/main/java/com/autotune/common/utils/CommonUtils.java +++ b/src/main/java/com/autotune/common/utils/CommonUtils.java @@ -19,12 +19,13 @@ import com.autotune.common.datasource.DataSourceCollection; import com.autotune.common.datasource.DataSourceInfo; import com.autotune.common.datasource.DataSourceManager; + import com.autotune.utils.KruizeConstants; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; import java.sql.Timestamp; -import java.util.Calendar; -import java.util.Collections; -import java.util.List; +import java.util.*; import java.util.concurrent.TimeUnit; import java.util.regex.Matcher; import java.util.regex.Pattern; @@ -34,6 +35,8 @@ */ public class CommonUtils { + private static final Logger LOGGER = LoggerFactory.getLogger(CommonUtils.class); + /** * AutotuneDatasourceTypes is an ENUM which holds different types of * datasources supported by Autotune diff --git a/src/main/java/com/autotune/database/dao/ExperimentDAOImpl.java b/src/main/java/com/autotune/database/dao/ExperimentDAOImpl.java index 21930c327..7b72baf77 100644 --- a/src/main/java/com/autotune/database/dao/ExperimentDAOImpl.java +++ b/src/main/java/com/autotune/database/dao/ExperimentDAOImpl.java @@ -1,3 +1,18 @@ +/******************************************************************************* + * Copyright (c) 2020, 2021 Red Hat, IBM Corporation and others. + * + * Licensed under the Apache License, Version 2.0 (the "License"); + * you may not use this file except in compliance with the License. + * You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + *******************************************************************************/ package com.autotune.database.dao; import com.autotune.analyzer.kruizeObject.KruizeObject; @@ -27,7 +42,10 @@ import java.time.LocalDateTime; import java.time.YearMonth; import java.time.temporal.ChronoUnit; -import java.util.*; +import java.util.ArrayList; +import java.util.Calendar; +import java.util.Date; +import java.util.List; import java.util.stream.IntStream; import static com.autotune.database.helper.DBConstants.DB_MESSAGES.DUPLICATE_KEY; @@ -150,9 +168,9 @@ public void addPartitions(String tableName, String month, String year, int dayOf year, month, String.format("%02d", i), year, month, String.format("%02d", i)); session.createNativeQuery(daterange).executeUpdate(); }); - } else if (partitionType.equalsIgnoreCase(DBConstants.PARTITION_TYPES.BY_DAY)) { - String daterange = String.format(DB_PARTITION_DATERANGE, tableName, year, month, String.format("%02d", 1), tableName, - year, month, String.format("%02d", 1), year, month, String.format("%02d", 1)); + } else if (partitionType.equalsIgnoreCase(DBConstants.PARTITION_TYPES.BY_DAY)) { //ROS not calling this condition + String daterange = String.format(DB_PARTITION_DATERANGE, tableName, year, month, dayOfTheMonth, tableName, + year, month, dayOfTheMonth, year, month, dayOfTheMonth); session.createNativeQuery(daterange).executeUpdate(); } else { LOGGER.error(DBConstants.DB_MESSAGES.INVALID_PARTITION_TYPE); @@ -239,7 +257,9 @@ public List addToDBAndFetchFailedResults(List loadAllPerformanceProfiles() throws E /** * Fetches all the Metric Profile records from KruizeMetricProfileEntry database table + * * @return List of all KruizeMetricProfileEntry database objects * @throws Exception */ @@ -779,7 +804,6 @@ public List loadExperimentFromDBByInputJSON(StringBuilder } - @Override public List loadResultsByExperimentName(String experimentName, String cluster_name, Timestamp calculated_start_time, Timestamp interval_end_time) throws Exception { // TODO: load only experimentStatus=inProgress , playback may not require completed experiments @@ -898,6 +922,7 @@ public List loadPerformanceProfileByName(String p /** * Fetches Metric Profile by name from KruizeMetricProfileEntry database table + * * @param metricProfileName Metric profile name * @return List of KruizeMetricProfileEntry objects * @throws Exception @@ -985,7 +1010,7 @@ public List loadMetadataByName(String dataSourceName) thr * Retrieves a list of KruizeDSMetadataEntry objects based on the specified datasource name and cluster name. * * @param dataSourceName The name of the datasource. - * @param clusterName The name of the cluster. + * @param clusterName The name of the cluster. * @return A list of KruizeDSMetadataEntry objects associated with the provided datasource and cluster name. * @throws Exception If there is an error while loading metadata from the database. */ @@ -1010,8 +1035,8 @@ public List loadMetadataByClusterName(String dataSourceNa * datasource name, cluster name and namespace. * * @param dataSourceName The name of the datasource. - * @param clusterName The name of the cluster. - * @param namespace namespace + * @param clusterName The name of the cluster. + * @param namespace namespace * @return A list of KruizeDSMetadataEntry objects associated with the provided datasource, cluster name and namespaces. * @throws Exception If there is an error while loading metadata from the database. */ @@ -1021,7 +1046,7 @@ public List loadMetadataByNamespace(String dataSourceName Query kruizeMetadataQuery = session.createQuery(SELECT_FROM_METADATA_BY_DATASOURCE_NAME_CLUSTER_NAME_AND_NAMESPACE, KruizeDSMetadataEntry.class) .setParameter("datasource_name", dataSourceName) .setParameter("cluster_name", clusterName) - .setParameter("namespace",namespace); + .setParameter("namespace", namespace); kruizeMetadataList = kruizeMetadataQuery.list(); } catch (Exception e) { @@ -1066,14 +1091,16 @@ public List loadAllDataSources() throws Exception { private void getExperimentTypeInKruizeExperimentEntry(List entries) throws Exception { try (Session session = KruizeHibernateUtil.getSessionFactory().openSession()) { - for (KruizeExperimentEntry entry: entries) { + for (KruizeExperimentEntry entry : entries) { if (isTargetCluserLocal(entry.getTarget_cluster())) { - String sql = DBConstants.SQLQUERY.SELECT_EXPERIMENT_EXP_TYPE; - Query query = session.createNativeQuery(sql); - query.setParameter("experiment_id", entry.getExperiment_id()); - List experimentType = query.getResultList(); - if (null != experimentType && !experimentType.isEmpty()) { - entry.setExperimentType(experimentType.get(0)); + if (null == entry.getExperimentType() || entry.getExperimentType().isEmpty()) { + String sql = DBConstants.SQLQUERY.SELECT_EXPERIMENT_EXP_TYPE; + Query query = session.createNativeQuery(sql); + query.setParameter("experiment_id", entry.getExperiment_id()); + List experimentType = query.getResultList(); + if (null != experimentType && !experimentType.isEmpty()) { + entry.setExperimentType(experimentType.get(0)); + } } } } @@ -1101,7 +1128,7 @@ private void updateExperimentTypeInKruizeExperimentEntry(KruizeExperimentEntry k } private void getExperimentTypeInKruizeRecommendationsEntry(List entries) throws Exception { - for (KruizeRecommendationEntry recomEntry: entries) { + for (KruizeRecommendationEntry recomEntry : entries) { getExperimentTypeInSingleKruizeRecommendationsEntry(recomEntry); } } diff --git a/src/main/java/com/autotune/database/helper/DBHelpers.java b/src/main/java/com/autotune/database/helper/DBHelpers.java index fd09f54ec..8b3d018fd 100644 --- a/src/main/java/com/autotune/database/helper/DBHelpers.java +++ b/src/main/java/com/autotune/database/helper/DBHelpers.java @@ -16,6 +16,8 @@ package com.autotune.database.helper; +import com.autotune.analyzer.adapters.DeviceDetailsAdapter; +import com.autotune.analyzer.adapters.RecommendationItemAdapter; import com.autotune.analyzer.exceptions.InvalidConversionOfRecommendationEntryException; import com.autotune.analyzer.kruizeObject.KruizeObject; import com.autotune.analyzer.kruizeObject.SloInfo; @@ -32,6 +34,7 @@ import com.autotune.common.data.result.ContainerData; import com.autotune.common.data.result.ExperimentResultData; import com.autotune.common.data.result.NamespaceData; +import com.autotune.common.data.system.info.device.DeviceDetails; import com.autotune.common.datasource.DataSourceCollection; import com.autotune.common.datasource.DataSourceInfo; import com.autotune.common.datasource.DataSourceMetadataOperator; @@ -334,6 +337,8 @@ public static KruizeResultsEntry convertExperimentResultToExperimentResultsTable .enableComplexMapKeySerialization() .setDateFormat(KruizeConstants.DateFormats.STANDARD_JSON_DATE_FORMAT) .registerTypeAdapter(Date.class, new GsonUTCDateAdapter()) + .registerTypeAdapter(AnalyzerConstants.RecommendationItem.class, new RecommendationItemAdapter()) + .registerTypeAdapter(DeviceDetails.class, new DeviceDetailsAdapter()) .create(); try { kruizeResultsEntry = new KruizeResultsEntry(); @@ -473,6 +478,8 @@ public static KruizeRecommendationEntry convertKruizeObjectTORecommendation(Krui .enableComplexMapKeySerialization() .setDateFormat(KruizeConstants.DateFormats.STANDARD_JSON_DATE_FORMAT) .registerTypeAdapter(Date.class, new GsonUTCDateAdapter()) + .registerTypeAdapter(AnalyzerConstants.RecommendationItem.class, new RecommendationItemAdapter()) + .registerTypeAdapter(DeviceDetails.class, new DeviceDetailsAdapter()) .create(); try { ListRecommendationsAPIObject listRecommendationsAPIObject = getListRecommendationAPIObjectForDB( @@ -480,7 +487,12 @@ public static KruizeRecommendationEntry convertKruizeObjectTORecommendation(Krui if (null == listRecommendationsAPIObject) { return null; } - LOGGER.debug(new GsonBuilder().setPrettyPrinting().create().toJson(listRecommendationsAPIObject)); + LOGGER.debug(new GsonBuilder() + .setPrettyPrinting() + .registerTypeAdapter(AnalyzerConstants.RecommendationItem.class, new RecommendationItemAdapter()) + .registerTypeAdapter(DeviceDetails.class, new DeviceDetailsAdapter()) + .create() + .toJson(listRecommendationsAPIObject)); kruizeRecommendationEntry = new KruizeRecommendationEntry(); kruizeRecommendationEntry.setVersion(KruizeConstants.KRUIZE_RECOMMENDATION_API_VERSION.LATEST.getVersionNumber()); kruizeRecommendationEntry.setExperiment_name(listRecommendationsAPIObject.getExperimentName()); @@ -557,6 +569,8 @@ public static List convertResultEntryToUpdateResultsAPIO .enableComplexMapKeySerialization() .setDateFormat(KruizeConstants.DateFormats.STANDARD_JSON_DATE_FORMAT) .registerTypeAdapter(Date.class, new GsonUTCDateAdapter()) + .registerTypeAdapter(AnalyzerConstants.RecommendationItem.class, new RecommendationItemAdapter()) + .registerTypeAdapter(DeviceDetails.class, new DeviceDetailsAdapter()) .create(); List updateResultsAPIObjects = new ArrayList<>(); for (KruizeResultsEntry kruizeResultsEntry : kruizeResultsEntries) { @@ -626,6 +640,8 @@ public static List convertRecommendationEntryToRec .enableComplexMapKeySerialization() .setDateFormat(KruizeConstants.DateFormats.STANDARD_JSON_DATE_FORMAT) .registerTypeAdapter(Date.class, new GsonUTCDateAdapter()) + .registerTypeAdapter(AnalyzerConstants.RecommendationItem.class, new RecommendationItemAdapter()) + .registerTypeAdapter(DeviceDetails.class, new DeviceDetailsAdapter()) .create(); List listRecommendationsAPIObjectList = new ArrayList<>(); for (KruizeRecommendationEntry kruizeRecommendationEntry : kruizeRecommendationEntryList) { diff --git a/src/main/java/com/autotune/database/service/ExperimentDBService.java b/src/main/java/com/autotune/database/service/ExperimentDBService.java index b04614068..270bff3c1 100644 --- a/src/main/java/com/autotune/database/service/ExperimentDBService.java +++ b/src/main/java/com/autotune/database/service/ExperimentDBService.java @@ -27,7 +27,6 @@ import com.autotune.common.data.dataSourceMetadata.DataSourceMetadataInfo; import com.autotune.common.data.result.ExperimentResultData; import com.autotune.common.datasource.DataSourceInfo; -import com.autotune.common.k8sObjects.K8sObject; import com.autotune.database.dao.ExperimentDAO; import com.autotune.database.dao.ExperimentDAOImpl; import com.autotune.database.helper.DBConstants; @@ -39,10 +38,7 @@ import org.slf4j.LoggerFactory; import java.sql.Timestamp; -import java.time.LocalDateTime; -import java.util.ArrayList; -import java.util.List; -import java.util.Map; +import java.util.*; public class ExperimentDBService { private static final long serialVersionUID = 1L; @@ -251,11 +247,15 @@ public ValidationOutputData addRecommendationToDB(Map expe convertKruizeObjectTORecommendation(kruizeObject, interval_end_time); if (null != kr) { if (KruizeDeploymentInfo.local == true) { //todo this code will be removed - LocalDateTime localDateTime = kr.getInterval_end_time().toLocalDateTime(); + // Create a Calendar object and set the time with the timestamp + Calendar localDateTime = Calendar.getInstance(TimeZone.getTimeZone("UTC")); + localDateTime.setTime(kr.getInterval_end_time()); ExperimentDAO dao = new ExperimentDAOImpl(); - int dayOfTheMonth = localDateTime.getDayOfMonth(); + int dayOfTheMonth = localDateTime.get(Calendar.DAY_OF_MONTH); try { - dao.addPartitions(DBConstants.TABLE_NAMES.KRUIZE_RECOMMENDATIONS, String.format("%02d", localDateTime.getMonthValue()), String.valueOf(localDateTime.getYear()), dayOfTheMonth, DBConstants.PARTITION_TYPES.BY_MONTH); + synchronized (new Object()) { + dao.addPartitions(DBConstants.TABLE_NAMES.KRUIZE_RECOMMENDATIONS, String.format("%02d", localDateTime.get(Calendar.MONTH) + 1), String.valueOf(localDateTime.get(Calendar.YEAR)), dayOfTheMonth, DBConstants.PARTITION_TYPES.BY_DAY); + } } catch (Exception e) { LOGGER.warn(e.getMessage()); } @@ -285,6 +285,7 @@ public ValidationOutputData addPerformanceProfileToDB(PerformanceProfile perform /** * Adds Metric Profile to kruizeMetricProfileEntry + * * @param metricProfile Metric profile object to be added * @return ValidationOutputData object */ @@ -391,7 +392,8 @@ public void loadPerformanceProfileFromDBByName(Map p /** * Fetches Metric Profile by name from kruizeMetricProfileEntry - * @param metricProfileMap Map to store metric profile loaded from the database + * + * @param metricProfileMap Map to store metric profile loaded from the database * @param metricProfileName Metric profile name to be fetched * @return ValidationOutputData object */ diff --git a/src/main/java/com/autotune/operator/KruizeDeploymentInfo.java b/src/main/java/com/autotune/operator/KruizeDeploymentInfo.java index 4be00ff62..214fab595 100644 --- a/src/main/java/com/autotune/operator/KruizeDeploymentInfo.java +++ b/src/main/java/com/autotune/operator/KruizeDeploymentInfo.java @@ -79,7 +79,10 @@ public class KruizeDeploymentInfo { public static Integer bulk_update_results_limit = 100; public static Boolean local = false; public static Boolean log_http_req_resp = false; - + public static String recommendations_url; + public static int BULK_API_LIMIT = 1000; + public static int BULK_API_MAX_BATCH_SIZE = 100; + public static Integer bulk_thread_pool_size = 3; public static int generate_recommendations_date_range_limit_in_days = 15; public static Integer delete_partition_threshold_in_days = DELETE_PARTITION_THRESHOLD_IN_DAYS; private static Hashtable tunableLayerPair; diff --git a/src/main/java/com/autotune/utils/KruizeConstants.java b/src/main/java/com/autotune/utils/KruizeConstants.java index 15779cdae..ab3732843 100644 --- a/src/main/java/com/autotune/utils/KruizeConstants.java +++ b/src/main/java/com/autotune/utils/KruizeConstants.java @@ -17,6 +17,8 @@ package com.autotune.utils; +import com.autotune.analyzer.kruizeObject.CreateExperimentConfigBean; + import java.text.SimpleDateFormat; import java.util.Locale; import java.util.TimeZone; @@ -168,6 +170,7 @@ public static final class JSONKeys { public static final String CONTAINER_METRICS = "container_metrics"; public static final String METRICS = "metrics"; public static final String CONFIG = "config"; + public static final String METRIC = "metric"; public static final String CURRENT = "current"; public static final String NAME = "name"; public static final String QUERY = "query"; @@ -262,6 +265,10 @@ public static final class JSONKeys { public static final String PLOTS_DATAPOINTS = "datapoints"; public static final String PLOTS_DATA = "plots_data"; public static final String CONFIDENCE_LEVEL = "confidence_level"; + public static final String HOSTNAME = "Hostname"; + public static final String UUID = "UUID"; + public static final String DEVICE = "device"; + public static final String MODEL_NAME = "modelName"; private JSONKeys() { } @@ -407,6 +414,7 @@ private DataSourceConstants() { public static class DataSourceDetailsInfoConstants { public static final String version = "v1.0"; public static final String CLUSTER_NAME = "default"; + private DataSourceDetailsInfoConstants() { } } @@ -448,6 +456,7 @@ public static class DataSourceErrorMsgs { public static final String ENDPOINT_NOT_FOUND = "Service endpoint not found."; public static final String MISSING_DATASOURCE_INFO = "Datasource is missing, add a valid Datasource"; public static final String INVALID_DATASOURCE_INFO = "Datasource is either missing or is invalid"; + private DataSourceErrorMsgs() { } } @@ -459,6 +468,7 @@ public static class DataSourceQueryJSONKeys { public static final String METRIC = "metric"; public static final String VALUE = "value"; public static final String VALUES = "values"; + private DataSourceQueryJSONKeys() { } @@ -467,6 +477,7 @@ private DataSourceQueryJSONKeys() { public static class DataSourceQueryStatus { public static final String SUCCESS = "success"; public static final String ERROR = "error"; + private DataSourceQueryStatus() { } } @@ -477,6 +488,7 @@ public static class DataSourceQueryMetricKeys { public static final String WORKLOAD_TYPE = "workload_type"; public static final String CONTAINER_NAME = "container"; public static final String CONTAINER_IMAGE_NAME = "image"; + private DataSourceQueryMetricKeys() { } } @@ -484,6 +496,7 @@ private DataSourceQueryMetricKeys() { public static class DataSourceMetadataInfoConstants { public static final String version = "v1.0"; public static final String CLUSTER_NAME = "default"; + private DataSourceMetadataInfoConstants() { } } @@ -520,6 +533,7 @@ public static class DataSourceMetadataErrorMsgs { public static final String DATASOURCE_METADATA_VALIDATION_FAILURE_MSG = "Validation of imported metadata failed, mandatory fields missing: %s"; public static final String NAMESPACE_QUERY_VALIDATION_FAILED = "Validation failed for namespace data query."; public static final String DATASOURCE_OPERATOR_RETRIEVAL_FAILURE = "Failed to retrieve data source operator for provider: %s"; + private DataSourceMetadataErrorMsgs() { } } @@ -537,6 +551,7 @@ public static class DataSourceMetadataInfoJSONKeys { public static final String CONTAINERS = "containers"; public static final String CONTAINER_NAME = "container_name"; public static final String CONTAINER_IMAGE_NAME = "container_image_name"; + private DataSourceMetadataInfoJSONKeys() { } } @@ -661,6 +676,10 @@ public static final class KRUIZE_CONFIG_ENV_NAME { public static final String CLOUDWATCH_LOGS_LOG_LEVEL = "logging_cloudwatch_logLevel"; public static final String LOCAL = "local"; public static final String LOG_HTTP_REQ_RESP = "logAllHttpReqAndResp"; + public static final String RECOMMENDATIONS_URL = "recommendationsURL"; + public static final String BULK_API_LIMIT = "bulkapilimit"; + public static final String BULK_API_CHUNK_SIZE = "bulkapichunksize"; + public static final String BULK_THREAD_POOL_SIZE = "bulkThreadPoolSize"; } public static final class RecommendationEngineConstants { @@ -748,4 +767,30 @@ public static final class AuthenticationConstants { public static final String AUTHORIZATION = "Authorization"; } + + public static final class KRUIZE_BULK_API { + public static final String JOB_ID = "job_id"; + public static final String ERROR = "error"; + public static final String JOB_NOT_FOUND_MSG = "Job not found"; + public static final String IN_PROGRESS = "IN_PROGRESS"; + public static final String COMPLETED = "COMPLETED"; + public static final String FAILED = "FAILED"; + public static final String LIMIT_MESSAGE = "The number of experiments exceeds %s."; + public static final String NOTHING = "Nothing to do."; + // TODO : Bulk API Create Experiments defaults + public static final CreateExperimentConfigBean CREATE_EXPERIMENT_CONFIG_BEAN; + + // Static block to initialize the Bean + static { + CREATE_EXPERIMENT_CONFIG_BEAN = new CreateExperimentConfigBean(); + CREATE_EXPERIMENT_CONFIG_BEAN.setMode("monitor"); + CREATE_EXPERIMENT_CONFIG_BEAN.setTarget("local"); + CREATE_EXPERIMENT_CONFIG_BEAN.setVersion("v2.0"); + CREATE_EXPERIMENT_CONFIG_BEAN.setDatasourceName("prometheus-1"); + CREATE_EXPERIMENT_CONFIG_BEAN.setPerformanceProfile("resource-optimization-local-monitoring"); + CREATE_EXPERIMENT_CONFIG_BEAN.setThreshold(0.1); + CREATE_EXPERIMENT_CONFIG_BEAN.setMeasurementDurationStr("15min"); + CREATE_EXPERIMENT_CONFIG_BEAN.setMeasurementDuration(15); + } + } } diff --git a/src/main/java/com/autotune/utils/ServerContext.java b/src/main/java/com/autotune/utils/ServerContext.java index 2c95d3efe..eac7f6079 100644 --- a/src/main/java/com/autotune/utils/ServerContext.java +++ b/src/main/java/com/autotune/utils/ServerContext.java @@ -75,4 +75,7 @@ public class ServerContext { public static final String LIST_NAMESPACES = QUERY_CONTEXT + "listNamespaces"; public static final String LIST_DEPLOYMENTS = QUERY_CONTEXT + "listDeployments"; public static final String LIST_K8S_OBJECTS = QUERY_CONTEXT + "listK8sObjects"; + + //Bulk Service + public static final String BULK_SERVICE = ROOT_CONTEXT + "bulk"; } diff --git a/src/main/java/com/autotune/utils/Utils.java b/src/main/java/com/autotune/utils/Utils.java index 3d65dea4c..1b3b281de 100644 --- a/src/main/java/com/autotune/utils/Utils.java +++ b/src/main/java/com/autotune/utils/Utils.java @@ -16,9 +16,12 @@ package com.autotune.utils; +import com.autotune.analyzer.adapters.DeviceDetailsAdapter; +import com.autotune.analyzer.adapters.RecommendationItemAdapter; import com.autotune.analyzer.utils.AnalyzerConstants; import com.autotune.analyzer.utils.GsonUTCDateAdapter; import com.autotune.common.data.result.ContainerData; +import com.autotune.common.data.system.info.device.DeviceDetails; import com.google.gson.ExclusionStrategy; import com.google.gson.FieldAttributes; import com.google.gson.Gson; @@ -169,6 +172,8 @@ public static T getClone(T object, Class classMetadata) { .setPrettyPrinting() .enableComplexMapKeySerialization() .registerTypeAdapter(Date.class, new GsonUTCDateAdapter()) + .registerTypeAdapter(AnalyzerConstants.RecommendationItem.class, new RecommendationItemAdapter()) + .registerTypeAdapter(DeviceDetails.class, new DeviceDetailsAdapter()) .create(); String serialisedString = gson.toJson(object); diff --git a/tests/test_plans/test_plan_rel_0.0.25.md b/tests/test_plans/test_plan_rel_0.0.25.md new file mode 100644 index 000000000..312ca0a8c --- /dev/null +++ b/tests/test_plans/test_plan_rel_0.0.25.md @@ -0,0 +1,134 @@ +# KRUIZE TEST PLAN RELEASE 0.0.25 + +- [INTRODUCTION](#introduction) +- [FEATURES TO BE TESTED](#features-to-be-tested) +- [BUG FIXES TO BE TESTED](#bug-fixes-to-be-tested) +- [TEST ENVIRONMENT](#test-environment) +- [TEST DELIVERABLES](#test-deliverables) + - [New Test Cases Developed](#new-test-cases-developed) + - [Regression Testing](#regresion-testing) +- [SCALABILITY TESTING](#scalability-testing) +- [RELEASE TESTING](#release-testing) +- [TEST METRICS](#test-metrics) +- [RISKS AND CONTINGENCIES](#risks-and-contingencies) +- [APPROVALS](#approvals) + +----- + +## INTRODUCTION + +This document describes the test plan for Kruize remote monitoring release 0.0.25 + +---- + +## FEATURES TO BE TESTED + +* Addition of Metric profile json into Kruize manifests +* Support for Datasource authentication using bearer token +* Support for Kruize Local Namespace level recommendations + +------ + +## BUG FIXES TO BE TESTED + +* Configure openshift port for prometheus service + +--- + +## TEST ENVIRONMENT + +* Minikube Cluster +* Openshift Cluster + +--- + +## TEST DELIVERABLES + +### New Test Cases Developed + +| # | ISSUE (NEW FEATURE) | TEST DESCRIPTION | TEST DELIVERABLES | RESULTS | COMMENTS | +|---|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-----------------------------------------------------------------------------------------|--------------------------------------------------------|---------| --- | +| 1 | Addition of Metric profile json into Kruize manifests | Metric profile json location update in existing tests and demos | | PASSED | | +| 2 | [Support for Datasource authentication using bearer token](https://github.com/kruize/autotune/pull/1289) | Tested manually | | PASSED | | +| 3 | Support for Kruize Local Namespace level recommendations [1248](https://github.com/kruize/autotune/pull/1248), [1249](https://github.com/kruize/autotune/pull/1249), [1275](https://github.com/kruize/autotune/pull/1275) | [New tests added](https://github.com/kruize/autotune/pull/1293) | | | | +| 4 | [Configure openshift port for prometheus service](https://github.com/kruize/autotune/pull/1278) | Updated existing tests to test with the specified datasource service name and namespace | [1291](https://github.com/kruize/autotune/pull/1291) | PASSED | | + + + +### Regression Testing + +| # | ISSUE (BUG/NEW FEATURE) | TEST CASE | RESULTS | COMMENTS | +|---|-------------------------------------------------------|---------------------------------------------------------|---------| --- | +| 1 | Addition of Metric profile json into Kruize manifests | Kruize local monitoring tests and local monitoring demo | PASSED | | +| 2 | Configure openshift port for prometheus service | Kruize local monitoring functional tests | PASSED | | + +--- + +## SCALABILITY TESTING + +Evaluate Kruize Scalability on OCP, with 5k experiments by uploading resource usage data for 15 days and update recommendations. +Changes do not have scalability implications. Short scalability test will be run as part of the release testing + +Short Scalability run +- 5K exps / 15 days of results / 2 containers per exp +- Kruize replicas - 10 +- OCP - Scalelab cluster + +Kruize Release | Exps / Results / Recos | Execution time | Latency (Max/ Avg) in seconds | | | Postgres DB size(MB) | Kruize Max CPU | Kruize Max Memory (GB) +-- |------------------------|----------------|-------------------------------|---------------|----------------------|----------------------|----------------| -- + | | | | UpdateRecommendations | UpdateResults | LoadResultsByExpName | | | +0.0.24_mvp | 5K / 72L / 3L | 4h 04 mins | 0.8 / 0.47 | 0.13 / 0.12 | 0.53 / 0.36 | 21752 | 4.63 | 34.72 +0.0.25_mvp | 5K / 72L / 3L | 4h 06 mins | 0.8 / 0.47 | 0.14 / 0.12 | 0.52 / 0.36 | 21756 | 4.91 | 30.13 + +---- +## RELEASE TESTING + +As part of the release testing, following tests will be executed: +- [Kruize Remote monitoring Functional tests](/tests/scripts/remote_monitoring_tests/Remote_monitoring_tests.md) +- [Fault tolerant test](/tests/scripts/remote_monitoring_tests/fault_tolerant_tests.md) +- [Stress test](/tests/scripts/remote_monitoring_tests/README.md) +- [DB Migration test](/tests/scripts/remote_monitoring_tests/db_migration_test.md) +- [Recommendation and box plot values validation test](https://github.com/kruize/kruize-demos/blob/main/monitoring/remote_monitoring_demo/recommendations_infra_demo/README.md) +- [Scalability test (On openshift)](/tests/scripts/remote_monitoring_tests/scalability_test.md) - scalability test with 5000 exps / 15 days usage data +- [Kruize remote monitoring demo (On minikube)](https://github.com/kruize/kruize-demos/blob/main/monitoring/remote_monitoring_demo/README.md) +- [Kruize local monitoring demo (On openshift)](https://github.com/kruize/kruize-demos/blob/main/monitoring/local_monitoring_demo) +- [Kruize local monitoring Functional tests](/tests/scripts/local_monitoring_tests/Local_monitoring_tests.md) + + +| # | TEST SUITE | EXPECTED RESULTS | ACTUAL RESULTS | COMMENTS | +| --- | ---------- |-----------------------------------------|-----------------------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| +| 1 | Kruize Remote monitoring Functional testsuite | TOTAL - 359, PASSED - 316 / FAILED - 43 | TOTAL - 359, PASSED - 316 / FAILED - 43 | Intermittent issue seen [1281](https://github.com/kruize/autotune/issues/1281), existing issues - [559](https://github.com/kruize/autotune/issues/559), [610](https://github.com/kruize/autotune/issues/610) | +| 2 | Fault tolerant test | PASSED | PASSED | | +| 3 | Stress test | PASSED | PASSED | | +| 4 | Scalability test (short run)| | | Exps - 5000, Results - 72000, execution time - 4 hours 6 mins | +| 5 | DB Migration test | PASSED | PASSED | Tested on Openshift | +| 6 | Recommendation and box plot values validations | PASSED | PASSED | | +| 7 | Kruize remote monitoring demo | PASSED | PASSED | Tested manually | +| 8 | Kruize Local monitoring demo | PASSED | PASSED | | +| 9 | Kruize Local Functional tests | TOTAL - 78, PASSED - 75 / FAILED - 3 | TOTAL - 78, PASSED - 75 / FAILED - 3 | [Issue 1217](https://github.com/kruize/autotune/issues/1217), [Issue 1273](https://github.com/kruize/autotune/issues/1273) | + +--- + +## TEST METRICS + +### Test Completion Criteria + +* All must_fix defects identified for the release are fixed +* New features work as expected and tests have been added to validate these +* No new regressions in the functional tests +* All non-functional tests work as expected without major issues +* Documentation updates have been completed + +---- + +## RISKS AND CONTINGENCIES + +* None + +---- +## APPROVALS + +Sign-off + +---- +