Building Real-Time AI Streaming Applications

Server-Sent Events, token-by-token streaming, real-time transcription, and live AI features — a complete guide to streaming AI.

Why Streaming Matters for AI Applications

Nobody likes waiting. When a user sends a message to an AI chatbot, they expect to see the response appear character by character — not a blank screen for 10 seconds followed by a wall of text. Streaming fundamentally changes the perceived performance of AI applications. A response that starts appearing in 200ms feels instant, even if the full generation takes 10 seconds. A response that blocks for 10 seconds before showing anything feels broken. The psychological impact is well-documented. Studies show that users perceive streaming responses as 2-3x faster than non-streaming responses with identical total latency. The progressive reveal of content keeps users engaged and gives them confidence that the system is working. For chat applications, streaming is essentially a requirement — every major AI product from OpenAI's ChatGPT to Anthropic's Claude streams responses by default. Streaming also enables new interaction patterns. Users can interrupt and redirect mid-generation. They can read along as the model thinks, catching errors early. GreatChat streams text responses in its AI workspace so users see output as it arrives instead of waiting on a loading spinner. From a technical perspective, streaming reduces time-to-first-token (TTFT), which is the single most important latency metric for perceived performance. Even if total generation time is identical, starting the response sooner dramatically improves perceived speed. Intelligent routing helps here too — GreatRouter's ranking engine considers latency as a scoring dimension, preferring faster models when you request streaming on /v1/chat/completions. Note: /v1/auto/route returns complete responses; use the OpenAI-compatible chat endpoint when you need token-by-token streaming.

Implementing SSE Streaming with GreatRouter

Server-Sent Events (SSE) is the standard protocol for streaming AI responses. It's simpler than WebSockets (unidirectional server-to-client), works through standard HTTP, and is natively supported by browsers via the EventSource API. GreatRouter supports SSE streaming on /v1/chat/completions — add "stream": true and use model: "router" for automatic routing. Here's how to consume a streaming response in JavaScript:
const response = await fetch("https://api.greatrouterai.com/v1/chat/completions", {
  method: "POST",
  headers: {
    "Authorization": "Bearer YOUR_API_KEY",
    "Content-Type": "application/json"
  },
  body: JSON.stringify({
    model: "router",
    messages: [{ role: "user", content: "Tell me a story about a robot" }],
    stream: true
  })
});

const reader = response.body.getReader();
const decoder = new TextDecoder();

while (true) {
  const { done, value } = await reader.read();
  if (done) break;
  
  const chunk = decoder.decode(value);
  const lines = chunk.split("\n").filter(line => line.startsWith("data: "));
  
  for (const line of lines) {
    const data = JSON.parse(line.slice(6));
    if (data.choices?.[0]?.delta?.content) {
      process.stdout.write(data.choices[0].delta.content);
    }
  }
}
In Python with the OpenAI SDK (which handles SSE parsing automatically):
from openai import OpenAI

client = OpenAI(
    base_url="https://api.greatrouterai.com/v1",
    api_key="YOUR_API_KEY"
)

stream = client.chat.completions.create(
    model="router",
    messages=[{"role": "user", "content": "Tell me a story about a robot"}],
    stream=True
)

for chunk in stream:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="")
The key implementation details: always handle the [DONE] sentinel that marks stream completion, implement reconnection logic for dropped connections (SSE supports automatic reconnection with Last-Event-ID), and consider using a request timeout slightly longer than your maximum expected generation time. GreatRouter automatically handles provider-level streaming — if the selected model supports streaming, the response is streamed through transparently. If it doesn't, the router buffers the complete response and sends it as a single event.

Real-Time Audio: Streaming Speech-to-Text and Text-to-Speech

Beyond text streaming, real-time audio capabilities are transforming how users interact with AI. Streaming speech-to-text (transcribing audio as it's spoken) and streaming text-to-speech (generating audio as text is produced) enable natural voice interfaces that feel conversational rather than transactional. For speech-to-text, the pattern is straightforward: capture audio chunks from the user's microphone, send them to GreatRouter's ASR endpoint as they're recorded, and receive incremental transcription results. This enables live captioning in GreatChat's Meeting studio, where spoken words appear on screen as they are spoken. For text-to-speech, the pattern reverses: as the language model generates text tokens, each token (or small group of tokens) is sent to a TTS model that generates the corresponding audio. The audio plays progressively as the text generates, creating the experience of the AI "speaking" in real time. This is especially powerful for voice assistants, accessibility features, and hands-free interfaces. The technical challenge with streaming audio is latency management. Each hop — audio capture → encoding → network → ASR → text → LLM → text → TTS → network → decoding → playback — adds latency. Low-latency models and efficient routing keep conversational interfaces responsive. GreatRouter's model selection considers latency as a ranking factor for streaming requests. Multi-modal streaming — combining text, audio, and potentially video in a single real-time session — is the frontier. Imagine a video call where AI generates real-time captions, suggests responses, and can generate images or documents on demand. This is the direction that products like GreatChat are heading, enabled by streaming AI infrastructure that handles every modality through a single routing layer.