feat(observability): add instrumented_chat_stream for LiteLLM streaming metrics by joshua0chen · Pull Request #361 · scaleapi/scale-agentex-python

joshua0chen · 2026-05-19T00:15:47Z

Summary: Adds instrumented_chat_stream, an async generator wrapper that instruments a LiteLLM ChatCompletions stream with OTel metrics (TTFT, TTAT, TPS, cached input tokens, reasoning tokens)

Context:
Currently we have two LLM streaming paths:

OpenAI Responses API: handled by TemporalStreamingModel, which has inline ttft/ttat/tps instrumentation (done in feat(streaming): emit OTel metrics for ttft, tps, token counts #347)
LiteLLM ChatCompletions: used by custom agents (DUA, etc.) via ChatCmplStreamHandler, with no SDK-level streaming metrics

For the second path, the existing metrics infrastructure covers token counters but not streaming latency:

Problem:

LLMMetricsHooks.on_llm_end fires after the Runner fully consumes the stream and assembles a final ModelResponse. It records requests, input_tokens, output_tokens, cached_input_tokens, and reasoning_token, but it cannot record TTFT/TTAT/TPS because it never sees individual chunks or their arrival times.
TemporalStreamingModel records ttft/ttat/tps inline during its own stream iteration, but this only applies to the Responses API path. Agents using LiteLLM don't go through this class.

Solution:
This PR introduces a reusable SDK helper for agents using LiteLLM streaming

Usage Example (DUA):
https://github.com/scaleapi/agentex-agents/pull/1556

Greptile Summary

This PR introduces instrumented_chat_stream, an async generator wrapper that augments LiteLLM ChatCompletions streaming calls with OTel metrics (TTFT, TTAT, TPS, cached-token, and reasoning-token counts), filling the gap left by LLMMetricsHooks.on_llm_end which cannot observe individual streaming chunks.

The wrapper interposes on ChatCmplStreamHandler.handle_stream, capturing time.perf_counter() timestamps on each ResponseTextDeltaEvent, ResponseReasoningTextDeltaEvent, and ResponseFunctionCallArgumentsDeltaEvent to derive per-request TTFT, TTAT, and TPS histograms.
Token-detail counters (cached_input_tokens, reasoning_tokens) are extracted from LiteLLM's _hidden_params dict — a shared-by-reference object that stream_chunk_builder populates after the stream ends — rather than from the assembled ModelResponse, where LiteLLM strips these fields.

Confidence Score: 4/5

Safe to merge — the new file is additive and only fires when callers explicitly wrap their stream with it, leaving all existing paths unchanged.

The core instrumentation logic is sound: timing bookmarks are captured in the right order, _hidden_params is accessed after iteration completes, and the fallback chain for token-detail extraction is well-documented. Two concerns: the model_name label must match what agent.model produces or latency and request/token metrics will diverge in dashboards; and the no-double-count guarantee for cached/reasoning tokens relies on an undocumented LiteLLM internal behavior that could change across upgrades.

src/agentex/lib/core/observability/instrumented_chat_stream.py — verify the model_name convention matches agent.model and consider adding a comment or test to protect the no-double-count invariant.

Important Files Changed

Filename	Overview
src/agentex/lib/core/observability/instrumented_chat_stream.py	New async generator wrapper for LiteLLM streaming metrics. Well-structured with careful LiteLLM internals handling, but callers must ensure `model_name` matches `agent.model` or timing and request metrics will be unjoined in dashboards. Potential future double-counting of cached/reasoning tokens if LiteLLM behavior changes.

Sequence Diagram

sequenceDiagram
    participant Agent
    participant instrumented_chat_stream
    participant _usage_capturing_stream
    participant ChatCmplStreamHandler
    participant LiteLLM raw_stream

    Agent->>instrumented_chat_stream: async for event in ...
    Note over instrumented_chat_stream: stream_start = perf_counter()
    instrumented_chat_stream->>ChatCmplStreamHandler: handle_stream(response, _usage_capturing_stream())
    loop Each chunk
        ChatCmplStreamHandler->>_usage_capturing_stream: __anext__()
        _usage_capturing_stream->>LiteLLM raw_stream: __anext__()
        LiteLLM raw_stream-->>_usage_capturing_stream: chunk (usage + _hidden_params captured)
        _usage_capturing_stream-->>ChatCmplStreamHandler: chunk
        ChatCmplStreamHandler-->>instrumented_chat_stream: TResponseStreamEvent
        alt TOKEN event (Text/Reasoning/FnCall delta)
            Note over instrumented_chat_stream: first_token_at / first_answer_at / last_token_at updated
        else ResponseCompletedEvent
            Note over instrumented_chat_stream: output_tokens_count captured
        end
        instrumented_chat_stream-->>Agent: yield event (unchanged)
    end
    Note over instrumented_chat_stream: finally block runs
    Note over instrumented_chat_stream: record ttft_ms, ttat_ms, tps
    Note over instrumented_chat_stream: extract cached/reasoning tokens from _hidden_params["usage"]
    Note over instrumented_chat_stream: record cached_input_tokens, reasoning_tokens

Prompt To Fix All With AI

Fix the following 2 code review issues. Work through them one at a time, proposing concise fixes.

---

### Issue 1 of 2
src/agentex/lib/core/observability/instrumented_chat_stream.py:54-58
**`model_name` vs `agent.model` metric attribute mismatch**

`instrumented_chat_stream` records TTFT/TTAT/TPS under `{"model": model_name}`, while `LLMMetricsHooks.on_llm_end` records requests and token counters under `{"model": str(agent.model) if agent.model else "unknown"}`. If callers pass the model identifier in a different format than what `agent.model` produces (e.g., `"openai/gpt-4"` vs `"gpt-4"`), the two metric families will have a different `model` label value and cannot be correlated in dashboards. The docstring should specify the expected format — or accept the model from the same source as `agent.model` to guarantee alignment.

### Issue 2 of 2
src/agentex/lib/core/observability/instrumented_chat_stream.py:145-171
**Potential double-counting of `cached_input_tokens` and `reasoning_tokens`**

`LLMMetricsHooks.on_llm_end` also calls `m.cached_input_tokens.add(usage.input_tokens_details.cached_tokens or 0, attrs)` and `m.reasoning_tokens.add(usage.output_tokens_details.reasoning_tokens or 0, attrs)`. The PR's invariant — that LiteLLM strips these detail fields from the assembled `ModelResponse` so `on_llm_end` will always see `None` and skip them — is not codified in the code or tests. If a future LiteLLM version or a different provider does populate `input_tokens_details` in the assembled response, both paths will `.add()` to the same OTel counter for the same request, doubling the values. Adding a brief inline comment (or an integration test assertion) that documents why double-counting cannot occur would make this assumption explicit and catch regressions early.

_{Reviews (1): Last reviewed commit: "instrumented chat stream for observabili..." | Re-trigger Greptile}

Greptile also left 2 inline comments on this PR.

github-actions · 2026-05-19T00:16:00Z

This PR is targeting main, but PRs should target the next branch by default.

The main branch is reserved for release-please and Stainless automation. To resolve, pick one of:

Re-target the PR to next (recommended). On the PR page, click Edit next to the title and change the base branch to next.
Add the target-main label if this is an intentional exception (e.g. an urgent hotfix). The check will re-run and pass.

See CONTRIBUTING.md for the full branch model.

greptile-apps · 2026-05-19T00:20:55Z

+async def instrumented_chat_stream(
+    raw_stream: AsyncIterator,
+    response: Response,
+    model_name: str,
+) -> AsyncIterator[TResponseStreamEvent]:


model_name vs agent.model metric attribute mismatch

instrumented_chat_stream records TTFT/TTAT/TPS under {"model": model_name}, while LLMMetricsHooks.on_llm_end records requests and token counters under {"model": str(agent.model) if agent.model else "unknown"}. If callers pass the model identifier in a different format than what agent.model produces (e.g., "openai/gpt-4" vs "gpt-4"), the two metric families will have a different model label value and cannot be correlated in dashboards. The docstring should specify the expected format — or accept the model from the same source as agent.model to guarantee alignment.

Prompt To Fix With AI

This is a comment left during a code review. Path: src/agentex/lib/core/observability/instrumented_chat_stream.py Line: 54-58 Comment: **`model_name` vs `agent.model` metric attribute mismatch** `instrumented_chat_stream` records TTFT/TTAT/TPS under `{"model": model_name}`, while `LLMMetricsHooks.on_llm_end` records requests and token counters under `{"model": str(agent.model) if agent.model else "unknown"}`. If callers pass the model identifier in a different format than what `agent.model` produces (e.g., `"openai/gpt-4"` vs `"gpt-4"`), the two metric families will have a different `model` label value and cannot be correlated in dashboards. The docstring should specify the expected format — or accept the model from the same source as `agent.model` to guarantee alignment. How can I resolve this? If you propose a fix, please make it concise.

greptile-apps · 2026-05-19T00:20:56Z

+            # --- Token detail counters -------------------------------------------
+            # Prefer _hidden_params["usage"] (reconstructed by stream_chunk_builder
+            # with all detail fields) over raw per-chunk usage.
+            if _last_hidden_params is not None:
+                hp_usage = _last_hidden_params.get("usage")
+                if hp_usage is not None:
+                    raw_usage = hp_usage
+
+            cached_tokens = 0
+            reasoning_tokens = 0
+            if raw_usage is not None:
+                # prompt_tokens_details.cached_tokens (standard OpenAI field)
+                ptd = getattr(raw_usage, "prompt_tokens_details", None)
+                if ptd is not None:
+                    cached_tokens = getattr(ptd, "cached_tokens", 0) or 0
+                # Fallback: LiteLLM PrivateAttr _cache_read_input_tokens
+                if not cached_tokens:
+                    cached_tokens = getattr(raw_usage, "_cache_read_input_tokens", 0) or 0
+
+                ctd = getattr(raw_usage, "completion_tokens_details", None)
+                if ctd is not None:
+                    reasoning_tokens = getattr(ctd, "reasoning_tokens", 0) or 0
+
+            if cached_tokens > 0:
+                m.cached_input_tokens.add(cached_tokens, attrs)
+            if reasoning_tokens > 0:
+                m.reasoning_tokens.add(reasoning_tokens, attrs)


Potential double-counting of cached_input_tokens and reasoning_tokens

LLMMetricsHooks.on_llm_end also calls m.cached_input_tokens.add(usage.input_tokens_details.cached_tokens or 0, attrs) and m.reasoning_tokens.add(usage.output_tokens_details.reasoning_tokens or 0, attrs). The PR's invariant — that LiteLLM strips these detail fields from the assembled ModelResponse so on_llm_end will always see None and skip them — is not codified in the code or tests. If a future LiteLLM version or a different provider does populate input_tokens_details in the assembled response, both paths will .add() to the same OTel counter for the same request, doubling the values. Adding a brief inline comment (or an integration test assertion) that documents why double-counting cannot occur would make this assumption explicit and catch regressions early.

Prompt To Fix With AI

This is a comment left during a code review. Path: src/agentex/lib/core/observability/instrumented_chat_stream.py Line: 145-171 Comment: **Potential double-counting of `cached_input_tokens` and `reasoning_tokens`** `LLMMetricsHooks.on_llm_end` also calls `m.cached_input_tokens.add(usage.input_tokens_details.cached_tokens or 0, attrs)` and `m.reasoning_tokens.add(usage.output_tokens_details.reasoning_tokens or 0, attrs)`. The PR's invariant — that LiteLLM strips these detail fields from the assembled `ModelResponse` so `on_llm_end` will always see `None` and skip them — is not codified in the code or tests. If a future LiteLLM version or a different provider does populate `input_tokens_details` in the assembled response, both paths will `.add()` to the same OTel counter for the same request, doubling the values. Adding a brief inline comment (or an integration test assertion) that documents why double-counting cannot occur would make this assumption explicit and catch regressions early. How can I resolve this? If you propose a fix, please make it concise.

instrumented chat stream for observability

b91f6d6

greptile-apps Bot reviewed May 19, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(observability): add instrumented_chat_stream for LiteLLM streaming metrics#361

feat(observability): add instrumented_chat_stream for LiteLLM streaming metrics#361
joshua0chen wants to merge 1 commit into
mainfrom
jchen/instrumented-chat-stream

joshua0chen commented May 19, 2026 •

edited by greptile-apps Bot

Loading

Uh oh!

github-actions Bot commented May 19, 2026

Uh oh!

greptile-apps Bot May 19, 2026

Uh oh!

greptile-apps Bot May 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

joshua0chen commented May 19, 2026 • edited by greptile-apps Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 4/5

Important Files Changed

Sequence Diagram

Uh oh!

github-actions Bot commented May 19, 2026

Uh oh!

greptile-apps Bot May 19, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot May 19, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

joshua0chen commented May 19, 2026 •

edited by greptile-apps Bot

Loading