$ cat ai-engineering/streaming.md

Streaming and Backpressure

Token-by-token LLM streaming end to end: SSE vs WebSockets, partial JSON parsing, cancellation with AbortController, and where backpressure actually bites.

Jatin Bansal@blog:~/ai-engineering$ open streaming

A support chatbot ships at 09:00. By 10:30 the on-call has a flood of complaints: “the bot froze on me.” The dashboards say latency p50 is 4.8s, p99 is 12s — well within the budget on paper. The bug isn’t latency. The bot is not streaming. Users see a spinner for five seconds, then a wall of text. The same five seconds with token-by-token streaming would have read as fast even though the total wall time is identical. Streaming is rarely about throughput; it is about time-to-first-token (TTFT) becoming visible to the user, and about giving the user a cancel button that actually works.

Opening bridge

Yesterday’s piece on function calling and tool use covered the call/result loop as typed RPC. It mostly assumed non-streaming responses: send a request, wait for the full assistant message, dispatch tools, reply. That works for one-shot extractions, but it falls apart the moment a user is on the other end of a chat box, the moment you want to render a partial UI as a structured output is being decoded, or the moment a tool result starts arriving in chunks. Today is about the transport layer underneath every chat product you’ve used: how providers push tokens incrementally, how you parse partial JSON for tool calls, how cancellation actually flows from a “Stop” button down to the GPU, and where backpressure quietly breaks production.

What “streaming” actually is

Streaming, at the API level, is the provider pushing the decode loop’s output incrementally over a single open HTTP response instead of buffering the full message and returning it on close. Every major provider — Anthropic, OpenAI, Google, Mistral, the OSS serving stacks behind vLLM/TGI — exposes the same primitive: set stream=true (or call a dedicated stream method), receive a sequence of server-sent events (SSE), each carrying a typed event name and a JSON payload, until a terminal event closes the stream.

Streaming gets you four operational properties at once:

TTFT becomes the latency that matters. Total wall time is unchanged; perceived latency drops to the time-to-first-token. On a 1,500-token answer at ~80 tokens/sec, that’s 12.5s down to ~200ms — a 60× reduction in perceived latency for the same compute.
Cancellation is cheap. Closing the HTTP connection mid-stream interrupts decoding on the provider side; the GPU stops producing tokens you’ll be billed for. Without streaming, the entire request runs to max_tokens even if the user closed the tab.
Progressive rendering. UI can paint as tokens arrive — including tool-call arguments as they’re being decoded, which lets you show “checking inventory…” before the model finishes emitting the check_inventory tool call.
Bounded memory. The server never holds the full response in memory; each chunk is forwarded to the client and discarded. For a 100k-token response this is the difference between a 500MB buffer per request and a fixed ~64KB window.

The thing streaming does not get you is faster throughput. The model still decodes one token per step; total tokens-per-second is unchanged. If your problem is “the answer takes 12s and I have 8s,” streaming is the wrong layer to attack — look at model size, prompt caching, or latency optimization instead.

Intuition: producer/consumer with the network in the middle

Three actors, one pipe. The GPU is the producer — it emits a token every 10–20ms during decode. The client is the consumer — its rendering loop wants those tokens as soon as they exist. In between sit your provider’s API, your reverse proxy, your CDN, and your application backend if you’re proxying, each of which can buffer, batch, or drop chunks if misconfigured.

The implicit contract: every actor in the chain forwards each chunk immediately and applies no buffering beyond what TCP requires. The implicit failure mode: somebody in the chain buffers. The most common culprits are gzip compression layers that wait for a window of bytes before flushing, Nginx’s default proxy_buffering on which holds the upstream response in memory until close, and CDN edges that don’t recognize the text/event-stream MIME type. When you see “the API streams, but my app waits for the full message,” the bug is almost always in the proxy chain, not the SDK.

The distributed-systems parallel

Streaming an LLM response is the HTTP/2 long-lived response version of a producer/consumer queue with a single in-flight item — the flow control primitives apply, but the queue depth is one. The closer parallel is the Unix pipe: one writer, one reader, OS-level flow control, both sides die when the connection closes. The semantic match is almost exact — SSE is a one-way pipe with framing, automatic reconnection, and a Last-Event-ID checkpoint.

Backpressure is the term from reactive systems for what happens when the consumer can’t keep up with the producer. In LLM streaming the producer rate is capped (the GPU produces tokens at a fixed rate per stream) and the network path is the bottleneck, so backpressure manifests as TCP receive-window shrinkage, which causes the provider to pause decoding. The headline failure mode isn’t dropped tokens — TCP guarantees delivery — it’s the GPU sitting idle while waiting for the client’s window to drain, which on a shared inference cluster means your slot is wasted. Most production streams are I/O-bound on the client path, not the model, the moment you’re proxying through a poorly tuned middleware.

The deeper parallel is iterator vs collection — the same trade-off Java’s Stream<T> vs List<T> makes, the same one AsyncIterable vs Promise<T[]> makes in JavaScript. Streaming chooses the iterator: each consumer must handle one chunk at a time, no random access, no .length, and the only signal of completion is a terminal event. Code that wants to operate on “the full response” must accumulate, defeating the point unless you have a real reason to wait.

Mechanics: SSE event types in practice

Both Anthropic and OpenAI ship over SSE. The wire format is identical — event: <name>\ndata: <json>\n\n separated frames — but the event grammars differ. Memorize the Anthropic vocabulary first; the OpenAI Responses API events map onto the same shape with different names.

Anthropic’s Messages API streaming events

A single streaming response is bracketed by message_start and message_stop. Inside, each content block (a text block, a tool_use block, a thinking block) is bracketed by content_block_start / content_block_stop, with a sequence of content_block_delta events carrying the incremental payload. The delta type tells you what’s inside:

text_delta — append .text to your text buffer.
input_json_delta — append .partial_json to a buffer indexed by the block index; this is the streaming form of a tool call’s input argument.
thinking_delta — append .thinking for extended thinking blocks; treat as a separate buffer.
signature_delta — the cryptographic signature for the thinking block, only emitted in extended-thinking mode.

message_delta carries top-level changes (final stop_reason, final usage). ping is a keep-alive — ignore it. error is fatal — surface it to the client and tear down.

The mental model: text and tool-call arguments are streamed in parallel. A single assistant turn that contains a text preamble and two tool calls will interleave four delta streams (one text, two input_json_delta, plus any thinking) under the same message_start. Your accumulator must be indexed by (content_block_index, delta_type).

OpenAI’s Responses API events

The Responses API emits semantic events with the prefix response.*. The shape is similar: response.created opens the stream, response.output_text.delta carries text chunks (you concatenate .delta strings), response.function_call_arguments.delta carries streaming tool-call arguments, and response.completed closes. Tool calls have their own progress events (response.function_call.in_progress, .completed) which the older Chat Completions streaming chunks did not. Reach for the Responses API when you’re starting fresh; the Chat Completions streaming format is still supported but is the older shape.

Why SSE and not WebSockets

The dominant question every team asks once: “shouldn’t this be a WebSocket?” The answer is almost always no. SSE is unidirectional server-to-client — exactly what LLM streaming needs — and rides on a plain HTTP response. That gives you:

Standard load balancers, CDNs, and reverse proxies handle SSE without special configuration. WebSockets need explicit Upgrade: websocket handling and sticky sessions at every hop.
Automatic reconnect via Last-Event-ID is part of the SSE spec; reconnect logic on WebSockets is the caller’s problem.
No protocol upgrade — the connection is HTTP throughout, observable in the same logs and traces as the rest of your stack.

WebSockets win when both sides need to send data on the same connection at high frequency — voice-mode interrupts mid-generation, collaborative agents, anything that needs full-duplex. For 95% of chat products and tool-using agents, SSE is the right transport and adding WebSockets is undirected complexity. The 2026 industry shift toward WebSockets for interactive AI is real but narrow: it’s bidirectional voice and real-time interrupt protocols, not chat. See Hivenet’s overview of SSE vs WebSockets for LLM apps for a careful breakdown.

Code: Python with the Anthropic SDK

Stream a message, accumulate text and a tool call in parallel, handle abort. Install: pip install anthropic. The Anthropic SDK ships a high-level messages.stream context manager that handles event accumulation for you, but you should know the lower-level shape because it’s what every framework wraps.

python

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
import json
import signal
from anthropic import Anthropic

client = Anthropic()

TOOLS = [{
    "name": "get_weather",
    "description": "Get current weather for a location.",
    "input_schema": {
        "type": "object",
        "properties": {"location": {"type": "string"}},
        "required": ["location"],
    },
}]

def stream_with_tools(user_msg: str) -> dict:
    text_parts: list[str] = []
    # tool call inputs arrive as a stream of JSON fragments per content block
    tool_inputs: dict[int, str] = {}
    tool_meta: dict[int, dict] = {}

    cancelled = {"flag": False}
    signal.signal(signal.SIGINT, lambda *_: cancelled.__setitem__("flag", True))

    with client.messages.stream(
        model="claude-opus-4-7",
        max_tokens=1024,
        tools=TOOLS,
        messages=[{"role": "user", "content": user_msg}],
    ) as stream:
        for event in stream:
            if cancelled["flag"]:
                stream.close()  # closes the underlying HTTP response
                break

            if event.type == "content_block_start":
                if event.content_block.type == "tool_use":
                    tool_meta[event.index] = {
                        "id": event.content_block.id,
                        "name": event.content_block.name,
                    }
                    tool_inputs[event.index] = ""
            elif event.type == "content_block_delta":
                d = event.delta
                if d.type == "text_delta":
                    text_parts.append(d.text)
                    print(d.text, end="", flush=True)
                elif d.type == "input_json_delta":
                    tool_inputs[event.index] += d.partial_json
            elif event.type == "message_stop":
                break

    # final, complete tool inputs — only safe to json.loads once the stream is done
    tools_finalized = [
        {"id": tool_meta[i]["id"], "name": tool_meta[i]["name"],
         "input": json.loads(tool_inputs[i])}
        for i in sorted(tool_inputs)
        if tool_inputs[i]
    ]
    return {"text": "".join(text_parts), "tools": tools_finalized}

Three points worth flagging. First, text and tool-call JSON arrive interleaved under different event.index values — keep them in separate buffers keyed by index. Second, the input_json_delta chunks are not individually parseable JSON; they’re fragments of a single growing JSON string, and you should only json.loads after the matching content_block_stop. If you must render tool arguments incrementally before the block closes (e.g., to show “searching for X…” as the query field is decoded), use a partial JSON parser rather than incremental json.loads, which will raise on every fragment. Third, calling stream.close() actually closes the underlying HTTP socket — the provider sees the disconnect and stops decoding within one or two tokens. Without that close, your KeyboardInterrupt only kills the Python process; on a deployed service you’d keep paying for tokens the user never saw.

Code: TypeScript with the Vercel AI SDK

The Vercel AI SDK’s streamText is the highest-leverage way to do this in a Next.js app — it returns a Response you can pipe straight to the client, an AsyncIterable you can iterate server-side, plus accumulated final values when the stream completes. Install: npm install ai @ai-sdk/anthropic zod.

typescript

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
import { anthropic } from "@ai-sdk/anthropic";
import { streamText, tool } from "ai";
import { z } from "zod";

const getWeather = tool({
  description: "Get current weather for a location.",
  inputSchema: z.object({ location: z.string() }),
  execute: async ({ location }) => ({ location, tempC: 18, conditions: "fog" }),
});

// Next.js route handler — wires the upstream stream straight through to the browser.
export async function POST(req: Request) {
  const { messages } = await req.json();

  const result = streamText({
    model: anthropic("claude-opus-4-7"),
    tools: { getWeather },
    messages,
    abortSignal: req.signal, // user closes the tab => upstream cancels
  });

  // toUIMessageStreamResponse() emits the SSE shape useChat() expects.
  return result.toUIMessageStreamResponse();
}

On the client, the useChat hook hides the SSE reader. A minimal version:

typescript

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
"use client";
import { useChat } from "@ai-sdk/react";

export function Chat() {
  const { messages, sendMessage, status, stop } = useChat();

  return (
    <>
      {messages.map((m) => (
        <div key={m.id}>
          <b>{m.role}:</b>
          {m.parts.map((p, i) =>
            p.type === "text" ? (
              <span key={i}>{p.text}</span>
            ) : p.type === "tool-getWeather" ? (
              <code key={i}>{JSON.stringify(p.input ?? {}, null, 2)}</code>
            ) : null,
          )}
        </div>
      ))}
      {status === "streaming" && <button onClick={stop}>Stop</button>}
      <button onClick={() => sendMessage({ text: "weather in SF?" })}>Send</button>
    </>
  );
}

Two implementation details. First, forwarding req.signal to streamText({ abortSignal }) is what makes the “user closed the tab” case actually stop the upstream call. Without it, the browser’s disconnect is invisible to your route handler and you pay for the full response. Second, useChat’s stop() calls the route’s abort path, which propagates down the chain — this is the cancellation story working end-to-end. The known sharp edge: very fast aborts (a stop() call within the first few tokens) can trigger onError instead of onAbort because the upstream is still in the connection-establishment phase (vercel/ai#8088); treat both paths the same in your UI logic.

Partial JSON parsing, in one paragraph

The hardest substream is input_json_delta: you’re receiving raw JSON fragments and you want to render the partial object as it grows. Naive incremental parsing — JSON.parse(buffer) after every delta — fails on every fragment until the very last one. The right tool is a permissive parser that closes the open brackets and quotes in flight: partial-json-parser (TypeScript) and partial-json-parser (Python) are the canonical libraries; LangChain’s JsonOutputParser ships an equivalent for its streaming chain APIs. Maintain parser state across calls — re-parsing from byte zero on every chunk turns the rendering loop O(n²) on response length, which gets brutal past 5–10k tokens. The same pattern shows up in structured output when you want to render a generateObject-style payload as the model decodes it; reach for the same library.

Trade-offs, failure modes, gotchas

Proxy buffering is the silent killer. Nginx defaults to proxy_buffering on, which holds the upstream response until the buffer fills or the upstream closes. The symptom is “the SDK streams, the curl streams, my deployed app doesn’t.” Fix at the proxy: proxy_buffering off, proxy_cache off, and X-Accel-Buffering: no as a response header from your app (Nginx honors it per-response). The same flag exists in different forms for Cloudflare (Cache-Control: no-transform plus disabling Auto Minify on the response), Vercel/Netlify Edge, and most other CDNs. Test on the deployed environment, not localhost — localhost has no proxy.

Cancellation must propagate. A “Stop” button that just hides the streaming <div> while the upstream call keeps running is the worst kind of bug: the user thinks they cancelled, you pay for the full response, and your cost dashboards diverge from your usage logs. The cancellation chain is: client AbortController → route handler req.signal → SDK abortSignal → fetch to provider → provider closes the GPU stream. Break any link and the cancel is decorative.

Tool-call deltas are interleaved, not sequential. The model can emit a text preamble and two parallel tool calls in the same assistant turn (Anthropic’s parallel tool use), and the content_block_delta events arrive interleaved under different indices. Your accumulator must be a map keyed by index, not a single flat string. Treating the stream as a single linear text buffer collapses the two tool calls into one corrupted JSON blob.

Backpressure exists, but inverted. TCP receive-window pressure on a slow client will cause the upstream to pause decoding — the GPU sits idle, your billing meter still ticks for the reservation. The signal to watch is server-side throughput per stream dropping below 50–60 tokens/sec on a known-fast model. If you’re proxying through Node.js, the response.write() calls return false when the client buffer is full; you must await drain rather than buffer in memory. The Node.js stream backpressure guide is the right reference. Skipping this is how a slow mobile client OOMs a Node worker handling a 100k-token response.

Don’t accumulate the whole response in memory. Easy to do accidentally: an arr.push(token); return arr.join("") pattern that re-allocates on every chunk turns a streaming endpoint into a memory-heavy one. For a 50k-token response with ~2-character tokens, that’s ~2.5 million string allocations and 100KB of doubled-up buffers. Stream out, don’t accumulate, unless you need the full text for downstream processing (and even then accumulate once at the end via the SDK’s finalText or accumulate()).

Refusal mid-stream is a thing. The model can decide to refuse partway through a response — typical pattern is a partial text block followed by a message_delta carrying stop_reason: "refusal" (Anthropic) or a response.refusal.delta (OpenAI’s Responses API). Your UI must handle the case where the stream ends with a refusal after the user has already seen some tokens; don’t render those orphaned tokens as a complete answer.

Streaming and strict structured output compose, but read the docs. OpenAI’s response_format: json_schema streams the JSON one fragment at a time; Anthropic’s strict tool use streams the input_json_delta the same way. The fragment-by-fragment validity guarantee is eventually consistent — at any given moment the buffer may not parse, but the final buffer is guaranteed to. Don’t try to validate against the schema until content_block_stop.

Retries on a streaming connection don’t compose naively. The standard SDK retry policy assumes a single buffered response. On a streaming call that drops 8k tokens in, retrying replays the whole prompt from scratch — billed again, with no automatic continuation. For long-form generation you want either a transparent retry only on initial-connection failures (most SDKs do this; check the docs), or an application-level checkpoint that re-prompts with <continue from here> semantics, which is awkward and prone to drift. Most production systems just surface stream errors to the user and let them retry by hand.

What to read next

The Agent Loop: ReAct and Its Descendants — the runtime that uses this article’s streaming transport. Once tool calls stream as input_json_delta fragments and cancellation propagates end to end, the agent loop is the next layer up — stopping conditions, plan-and-execute, and the harness duties around it.
Function Calling and Tool Use — the call/result loop that today’s article rewrote in streaming form. The two together cover the full agent transport surface.
Prompt Caching: Reusing the KV Cache Across Calls — the prefill-side pair of today’s decode-side article. Cached prefill + token streaming gets you the best perceived latency on the wire; the two together cover both phases of the inference call.
Context Engineering: JIT vs AOT Context Loading — once you can stream tool calls, JIT context loading becomes a streaming retrieval loop. The two patterns reinforce each other.