[Bug] Megaservices pending requests metric goes wrong after a while #1121

eero-t · 2025-01-07T20:14:28Z

Priority

P3-Medium

OS type

Ubuntu

Hardware type

Gaudi2

Installation method

Pull docker images from hub.docker.com
Build docker images from source

Deploy method

Docker compose
Docker
Kubernetes
Helm

Running nodes

Single Node

What's the version?

latest images / Git HEAD.

Description

Pending requests metric increments and decrements do not match, because after service has been seriously stressed for a while, metric value can be in hundreds when it should be zero, due to service idling without any pending requests. This makes the pending requests metric (worse than) useless.

Some kind of queue size metric is essential for service scaling, because with enough requests and multiple accelerated inferencing engines, also megaservice is bottleneck and needs to be scaled.

Notes

I did the PR adding this metric initially. At that point I did not notice this issue, possibly because service was not stressed long enough for the issue to show up.

I've tried (earlier today) adding pending count decrements in different places on ServiceOrchestrator, but those either did not have any effect on the metric, or made it completely bogus (decreased it way too much).

I also tried just changing the metric to incoming requests counter, but that does really not offer any significant value over (already provided) completed requests count. It does not tell how stressed the megaservices is, like the needed pending requests (= queue size) count would.

=> As I have other pressing OPEA work, I'm just documenting this here.

Reproduce steps

ChatQnA service, with multiple Gaudi accelerated TGI or vLLM backend instances
Stress ChatQnA for an hour by constantly sending requests from thousand parallel client connections
Stop sending new requests and wait until all pending ones have been processed
Query metrics end point for pending requests gauge: curl <svc:port>/metrics | grep pending

Raw log

No response

Attachments

No response

The text was updated successfully, but these errors were encountered:

eero-t · 2025-01-09T18:44:04Z

@Spycsh unless you have some good idea how this fix this issue for (my) merged #864 code, I think it needs to be reverted, and HTTP inprogress metrics used instead, despite their downsides.

Spycsh · 2025-01-10T02:41:33Z

@eero-t , think it may be affected by a race condition in calling logics even though k8s Gauge inc/dec are atomic,

    def __init__(self):
        self.lock = threading.Lock()

    def pending_update(self, increase: bool) -> None:
        with self.lock:
            if increase:
                self.request_pending.inc()
            else:
                self.request_pending.dec()

Do you think simply adding a lock here can solve this issue?

eero-t · 2025-01-10T13:55:30Z

@eero-t , think it may be affected by a race condition in calling logics even though k8s Gauge inc/dec are atomic,
...
Do you think simply adding a lock here can solve this issue?

No, as the actual Metric value change is already protected by Mutex:

Order in which the increments and decrements are done between threads i.e. different requests, does not matter. It matters only for individual requests, and because increment is always before decrement in the request flow, everything is fine as long as all incremented requests actually do have matching decrement.

Btw. the amount of extra pending requests seems to match largest number of parallel request/connections done to the Megaservice. Does that ring any bells where the decrement might be missing?

eero-t added the bug Something isn't working label Jan 7, 2025

eero-t mentioned this issue Jan 7, 2025

Create token metrics only when they are available #1092

Merged

1 task

xiguiw assigned eero-t and Spycsh Jan 14, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] Megaservices pending requests metric goes wrong after a while #1121

[Bug] Megaservices pending requests metric goes wrong after a while #1121

eero-t commented Jan 7, 2025 •

edited

Loading

eero-t commented Jan 9, 2025

Spycsh commented Jan 10, 2025 •

edited

Loading

eero-t commented Jan 10, 2025 •

edited

Loading

[Bug] Megaservices pending requests metric goes wrong after a while #1121

[Bug] Megaservices pending requests metric goes wrong after a while #1121

Comments

eero-t commented Jan 7, 2025 • edited Loading

Priority

OS type

Hardware type

Installation method

Deploy method

Running nodes

What's the version?

Description

Reproduce steps

Raw log

Attachments

eero-t commented Jan 9, 2025

Spycsh commented Jan 10, 2025 • edited Loading

eero-t commented Jan 10, 2025 • edited Loading

eero-t commented Jan 7, 2025 •

edited

Loading

Spycsh commented Jan 10, 2025 •

edited

Loading

eero-t commented Jan 10, 2025 •

edited

Loading