Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replace HTTP "inprogress" gauge with megaservice "request_pending" one #864

Merged
merged 4 commits into from
Nov 13, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
13 changes: 1 addition & 12 deletions comps/cores/mega/http_service.py
Original file line number Diff line number Diff line change
Expand Up @@ -35,18 +35,7 @@ def __init__(
self.uvicorn_kwargs = uvicorn_kwargs or {}
self.cors = cors
self._app = self._create_app()

# remove part before '@', used by register_microservice() callers, and
# part after '/', added by MicroService(), to get real service name, and
# convert invalid characters to '_'
suffix = re.sub(r"[^a-zA-Z0-9]", "_", self.title.split("/")[0].split("@")[-1].lower())

instrumentator = Instrumentator(
inprogress_name=f"http_requests_inprogress_{suffix}",
should_instrument_requests_inprogress=True,
inprogress_labels=True,
)
instrumentator.instrument(self._app).expose(self._app)
Instrumentator().instrument(self._app).expose(self._app)

@property
def app(self):
Expand Down
17 changes: 15 additions & 2 deletions comps/cores/mega/orchestrator.py
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@
import aiohttp
import requests
from fastapi.responses import StreamingResponse
from prometheus_client import Histogram
from prometheus_client import Gauge, Histogram
from pydantic import BaseModel

from ..proto.docarray import LLMParams
Expand All @@ -33,6 +33,7 @@ class OrchestratorMetrics:
first_token_latency = Histogram("megaservice_first_token_latency", "First token latency (histogram)")
inter_token_latency = Histogram("megaservice_inter_token_latency", "Inter-token latency (histogram)")
request_latency = Histogram("megaservice_request_latency", "Whole request/reply latency (histogram)")
request_pending = Gauge("megaservice_request_pending", "Count of currently pending requests (gauge)")

def __init__(self) -> None:
pass
Expand All @@ -48,6 +49,12 @@ def token_update(self, token_start: float, is_first: bool) -> float:
def request_update(self, req_start: float) -> None:
self.request_latency.observe(time.time() - req_start)

def pending_update(self, increase: bool) -> None:
if increase:
self.request_pending.inc()
else:
self.request_pending.dec()


class ServiceOrchestrator(DAG):
"""Manage 1 or N micro services in a DAG through Python API."""
Expand All @@ -74,13 +81,15 @@ def flow_to(self, from_service, to_service):
return False

async def schedule(self, initial_inputs: Dict | BaseModel, llm_parameters: LLMParams = LLMParams(), **kwargs):
req_start = time.time()
self.metrics.pending_update(True)

result_dict = {}
runtime_graph = DAG()
runtime_graph.graph = copy.deepcopy(self.graph)
if LOGFLAG:
logger.info(initial_inputs)

req_start = time.time()
timeout = aiohttp.ClientTimeout(total=1000)
async with aiohttp.ClientSession(trust_env=True, timeout=timeout) as session:
pending = {
Expand Down Expand Up @@ -146,6 +155,9 @@ def fake_stream(text):
if node not in nodes_to_keep:
runtime_graph.delete_node_if_exists(node)

if not llm_parameters.streaming:
self.metrics.pending_update(False)

return result_dict, runtime_graph

def process_outputs(self, prev_nodes: List, result_dict: Dict) -> Dict:
Expand Down Expand Up @@ -234,6 +246,7 @@ def generate():
token_start = self.metrics.token_update(token_start, is_first)
is_first = False
self.metrics.request_update(req_start)
self.metrics.pending_update(False)

return (
StreamingResponse(self.align_generator(generate(), **kwargs), media_type="text/event-stream"),
Expand Down
35 changes: 26 additions & 9 deletions comps/cores/telemetry/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,21 +6,23 @@ OPEA Comps currently provides telemetry functionalities for metrics and tracing

## Metrics

OPEA microservice metrics are exported in Prometheus format and are divided into two categories: general metrics and specific metrics.
OPEA microservice metrics are exported in Prometheus format under `/metrics` endpoint.

General metrics, such as `http_requests_total`, `http_request_size_bytes`, are exposed by every microservice endpoint using the [prometheus-fastapi-instrumentator](https://github.com/trallnag/prometheus-fastapi-instrumentator).
They come in several categories:

Specific metrics are the built-in metrics exposed under `/metrics` by each specific microservices such as TGI, vLLM, TEI and others. Both types of the metrics adhere to the Prometheus format.
- HTTP request metrics are exposed by every OPEA microservice using the [prometheus-fastapi-instrumentator](https://github.com/trallnag/prometheus-fastapi-instrumentator)
- Megaservices export additional metrics for application end-to-end load / performance
- Inferencing microservices such as TGI, vLLM, TEI provide their own metrics

### General Metrics

To access the general metrics of each microservice, you can use `curl` as follows:
They can be fetched e.g. with `curl`:

```bash
curl localhost:{port of your service}/metrics
```

Then you will see Prometheus format metrics printed out as follows:
### HTTP Metrics

Metrics output looks following:

```yaml
HELP http_requests_total Total number of requests by method, status and handler.
Expand All @@ -37,9 +39,22 @@ http_request_size_bytes_sum{handler="/v1/chatqna"} 128.0
...
```

### Specific Metrics
Most of them are histogram metrics.

### Megaservice E2E metrics

Applications' megaservice `ServiceOrchectrator` provides following metrics:

To access the metrics exposed by each specific microservice, ensure that you check the specific port and your port mapping to reach the `/metrics` endpoint correctly.
- `megaservice_first_token_latency`: time to first token (TTFT)
- `megaservice_inter_token_latency`: inter-token latency (ITL ~ TPOT)
- `megaservice_request_latency`: whole request E2E latency = TTFT + ITL \* tokens
- `megaservice_request_pending`: how many LLM requests are still in progress

Latency ones are histogram metrics i.e. include count, total value and set of value buckets for each item.

They are available only for _streaming_ requests using LLM. Pending count accounts for all requests.

### Inferencing Metrics

For example, you can `curl localhost:6006/metrics` to retrieve the TEI embedding metrics, and the output should look like follows:

Expand All @@ -66,6 +81,8 @@ te_request_inference_duration_bucket{le="0.000022500000000000005"} 0
te_request_inference_duration_bucket{le="0.00003375000000000001"} 0
```

### Metrics collection

These metrics can be scraped by the Prometheus server into a time-series database and further visualized using Grafana.

Below are some default metrics endpoints for specific microservices:
Expand Down