Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update observability README + fix typos #556

Merged
merged 3 commits into from
Nov 15, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
66 changes: 38 additions & 28 deletions kubernetes-addons/Observability/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -40,7 +40,7 @@ kubectl port-forward service/grafana 3000:80

Open your browser and navigate to http://localhost:3000. Use "admin/prom-operator" as the username and the password to login.

## 2. Metric for Gaudi Hardware(v1.16.2)
## 2. Metrics for Gaudi Hardware (v1.16.2)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The 1.1 release are using tgi-gaudi 2.0.6, which is validated with SW stack v 1.18. Shall we update/verify the Metrics with 1.18?

Copy link
Contributor Author

@eero-t eero-t Nov 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMHO not worth the trouble for that dashboard. I do not see why Habana would change metric names between version upgrades, and there's a much better Gaudi HW panel in Eval repo [1], which is more worth checking.

In the long run, I think it would be better to separate dashboard for different purposes to different repos, instead of duplicating them:

  • Drop Gaudi HW dashboard from here [1],
  • Move PCM one to Eval repo, and
  • Add k8s specific ones here, that do some assumptions about how deployments are named with current Helm charts, to allow user to select from multiple apps running in different namespaces

=> I can do that after v1.1.


[1] Gaudi HW one here is not very good. It's lacking most metrics, does not allow selecting a node or device for them, and as can be seen from its screenshot in the README, the metric legends are awful:

$ cat Dashboard-Gaudi-HW.json  | grep expr | cut -d'"' -f4- | sed 's/",$//'
habanalabs_temperature_onboard
habanalabs_kube_info
habanalabs_memory_free_bytes
habanalabs_power_mW
habanalabs_utilization

vs one in Eval repo:

$ cat gaudi_grafana.json | grep expr | cut -d'"' -f4- | sed 's/",$//'
habanalabs_device_config{instance=\"$node\", UUID=\"$hpu\"}
habanalabs_utilization{instance=\"$node\", UUID=\"$hpu\"}/100
habanalabs_power_mW{instance=\"$node\", UUID=\"$hpu\"}
habanalabs_pcie_receive_throughput{UUID=\"$hpu\", instance=\"$node\"}
habanalabs_pcie_transmit_throughput{UUID=\"$hpu\", instance=\"$node\"}
habanalabs_temperature_onchip{UUID=\"$hpu\", instance=\"$node\"}
habanalabs_temperature_onboard{UUID=\"$hpu\", instance=\"$node\"}
habanalabs_device_config{instance=\"$node\", UUID=\"$hpu\"}
habanalabs_device_config{instance=\"$node\", UUID=\"$hpu\"}
habanalabs_utilization{instance=\"$node\", UUID=\"$hpu\"}/100
habanalabs_temperature_onboard{UUID=\"$hpu\", instance=\"$node\"}
habanalabs_temperature_onchip{UUID=\"$hpu\", instance=\"$node\"}
habanalabs_clock_soc_mhz{UUID=\"$hpu\", instance=\"$node\"}
habanalabs_memory_used_bytes{UUID=\"$hpu\", instance=\"$node\"}
habanalabs_power_mW{instance=\"$node\", UUID=\"$hpu\"}
habanalabs_memory_used_bytes{UUID=\"$hpu\", instance=\"$node\"} / habanalabs_memory_total_bytes{UUID=\"$hpu\", instance=\"$node\"}


To monitor Gaudi hardware metrics, you can use the following steps:

Expand All @@ -64,8 +64,6 @@ kubectl apply -f ./habana/metric-exporter-serviceMonitor.yaml

### Step 4: Verify the metrics

The metric endpoints for habana will be a headless service, so we need to get endpoint to verify

```
# To get the metric endpoints, e.g. to get first endpoint to test
habana_metric_url=`kubectl -n monitoring get ep metric-exporter -o jsonpath="{.subsets[].addresses[0].ip}:{..subsets[].ports[0].port}"`
Expand Down Expand Up @@ -95,58 +93,70 @@ promhttp_metric_handler_requests_total{code="503"} 0

### Step 5: Import the dashboard into Grafana

Manually import ./habana/Dashboard-Gaudi-HW.json into Grafana
![alt text](image-1.png)
Manually import the [`Dashboard-Gaudi-HW.json`](./habana/Dashboard-Gaudi-HW.json) file into Grafana
![Gaudi HW dashboard](./assets/habana.png)

## 3. Metric for OPEA/chatqna
## 3. Metrics for OPEA applications

To monitor ChatQnA metrics including TGI-gaudi,TEI,TEI-Reranking and other micro services, you can use the following steps:
To monitor OPEA application metrics including TGI-gaudi, TEI, TEI-Reranking and other micro services, you can use the following steps:

### Step 1: Install ChatQnA by Helm
### Step 1: Install application with Helm

Install Helm (version >= 3.15) first. Refer to the [Helm Installation Guide](https://helm.sh/docs/intro/install/) for more information.

Refer to the [ChatQnA helm chart](https://github.com/opea-project/GenAIInfra/tree/main/helm-charts/chatqna/README.md) for instructions on deploying ChatQnA into Kubernetes on Xeon & Gaudi.
Install OPEA application as described in [Helm charts README](../../helm-charts/README.md).

### Step 2: Install all the serviceMonitor
For example, to install ChatQnA, follow [ChatQnA helm chart](https://github.com/opea-project/GenAIInfra/tree/main/helm-charts/chatqna/README.md) for instructions on deploying it to Kubernetes.

> NOTE:
> If the chatQnA installed into another instance instead of chatqna(Default instance name),you should modify the
> matchLabels app.kubernetes.io/instance:${instanceName} with proper instanceName
Make sure to enable [Helm monitoring option](../../helm-charts/monitoring.md).

```
kubectl apply -f chatqna/
```
### Step 2: Install dashboards

Here are few Grafana dashboards for monitoring different aspects of OPEA applications:

- [`queue_size_embedding_rerank_tgi.json`](./chatqna/dashboard/queue_size_embedding_rerank_tgi.json): queue size of TGI-gaudi, TEI-Embedding, TEI-reranking
- [`tgi_grafana.json`](./chatqna/dashboard/tgi_grafana.json): `tgi-gaudi` text generation inferencing service utilization
- [`opea-scaling.json`](./opea-apps/opea-scaling.json): scaling, request rates and failures for OPEA application megaservice, TEI-reranking, TEI-embedding, and TGI

### Step 3: Install the dashboard
You can either:

- manually import tgi_grafana.json into the Grafana to monitor the tgi-gaudi utilization
- manually import queue_size_embedding_rerank_tgi.json into the Grafana to monitor the queue size of TGI-gaudi,TEI-Embedding,TEI-reranking
- OR you could create dashboard to monitor all the services in ChatQnA by yourself
- Import them manually to Grafana,
- Use [`update-dashboards.sh`](./update-dashboards.sh) script to add them to Kubernetes as Grafana dashboard configMaps
- (Script assumes Prometheus / Grafana to be installed according to above instructions)
- Or create your own dashboards based on them

![alt text](image-2.png)
Note: when dashboard is imported to Grafana, you can directly save changes to it, but those dashboards go away if Grafana is removed / re-installed.

## 4. Metric for PCM(Intel® Performance Counter Monitor)
Whereas with dashboard configMaps, Grafana saves changes to a selected file, but you need to remember to re-apply them to Kubernetes / Grafana, for your changes to be there when that dashboard is reloaded.

![TGI dashboard](./assets/tgi.png)
![Scaling dashboard](./assets/opea-scaling.png)

## 4. Metrics for PCM (Intel® Performance Counter Monitor)

### Step 1: Install PCM

Please refer this repo to install [Intel® PCM](https://github.com/intel/pcm)
Please refer to this repo to install [Intel® PCM](https://github.com/intel/pcm)

### Step 2: Modify & Install pcm-service

modify the pcm/pcm-service.yaml to set the addresses
modify the `pcm/pcm-service.yaml` file to set the addresses

```
kubectl apply -f pcm/pcm-service.yaml
```

### Step 3: Install pcm serviceMonitor
### Step 3: Install PCM serviceMonitor

```
kubectl apply -f pcm/pcm-serviceMonitor.yaml
```

### Step 4: Install the pcm dashboard
### Step 4: Install the PCM dashboard

manually import the [`pcm-dashboard.json`](./pcm/pcm-dashboard.json) file into the Grafana
![PCM dashboard](./assets/pcm.png)

## More dashboards

manually import the pcm/pcm-dashboard.json into the Grafana
![alt text](image.png)
GenAIEval repository includes additional [dashboards](https://github.com/opea-project/GenAIEval/tree/main/evals/benchmark/grafana).
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.