Real-time inference · in private beta

The runtime for interactive video and world models.

Windflow is the inference platform for real-time video and world models — persistent session state, bidirectional streaming, sub-50 ms frame latency, delivered as a single API. The runtime is the boring part; we'd rather you spend your time on the application.

< 50 ms
Frame-latency target
0 fps
Sustained interactive
0+
Coherent frames / session
Helios · open worldlive
LingBot · action-controlled worldlive
CausVid · video restylelive
01 — The frame-rate shift

A wind tunnel, not a camera.

The first generation of generative video gave us beautiful frozen frames. The second gave us pre-rendered clips — motion implied, but the loop fixed. Both were cameras: capture, render, ship.

What is arriving now is a different instrument entirely. Models that respond mid-generation, hold state across thousands of frames, and answer a control signal in less time than a monitor takes to refresh. The shift is not faster video — it is video that reacts.

We think of it like the difference between a photograph of an aircraft and the wind tunnel that proved it could fly. One records; the other lets the world push back. The runtime for that medium is what we're building.

The research is here

Real-time video models exist. Production runtimes for them do not.

The last eighteen months produced a wave of models that respond mid-generation and hold state across thousands of frames. They are research artifacts — not APIs. Windflow exists to make this class of model something developers can actually build on.

Google DeepMind

Genie 3

720p · 24 fps interactive worlds

Real-time world model with persistent state across minutes of interaction.

Streaming DiT, 2024

StreamDiT

16 fps text-to-video · single H100

Causal diffusion transformer that produces video as a continuous stream, not a clip.

Causal video diffusion, 2024

CausVid

9.4 fps streaming inference

Autoregressive video diffusion proving sub-second control-to-frame loops are tractable.

Live from the runtime

What sub-50 ms actually looks like.

A single uninterrupted session per clip — no pre-renders, no cherry-picked seeds, no offline upscaling. Hover the monitor to pause auto-advance.

rec · Helios
1280×720·24 fps·frame 00000
p50 latency
47ms
01 / 04 · World model · open world

An infinite open world streamed frame-by-frame from a short prompt. Terrain, foliage and lighting hold across thousands of frames because the session's KV cache lives on the GPU between requests.

Helios · open worldLingBot · action-controlledLongLive · multi-shotCausVid · low-latency diffusionLTX-Video · real-time DiTHunyuanVideo · long horizonSub-50ms frame targetWebRTC + WebSocket transportPer-second GPU billing
Helios · open worldLingBot · action-controlledLongLive · multi-shotCausVid · low-latency diffusionLTX-Video · real-time DiTHunyuanVideo · long horizonSub-50ms frame targetWebRTC + WebSocket transportPer-second GPU billing
02 — Why existing infrastructure breaks

LLM serving stacks were not designed for this.

Inference platforms today are optimized for chat completions — fire a prompt, return a response, end the session. Real-time video models invert almost every assumption these systems were built on.

AxisLLM inferenceBatch diffusionWindflow
Workload shapeSingle prompt → single responseBatch of discrete generationsPersistent bidirectional session
Latency targetFirst-token in ~300 msTens of seconds per imageSub-50 ms per frame, sustained
State modelStateless or short context windowStatelessConsistent across 1000s of frames
Input channelText prompt at startPrompt + seedContinuous control: pose, action, audio
Optimization surfaceKV cache, speculative decodingThroughput batchingCustom CUDA + streaming schedulers
03 — The Windflow platform

Four layers, one streaming endpoint.

Each layer is exposed as a primitive. Use them together as a managed runtime, or pull just the one you need into an existing stack.

i.Layer

Streaming Runtime

A bidirectional WebSocket and WebRTC protocol that pipelines control inputs and generated frames in the same session. No polling, no awkward chunked HTTP, no batching latency tax.

  • WebRTC + WebSocket
  • < 50 ms RTT budget
  • Backpressure aware
ii.Layer

Session State

A persistence layer purpose-built for spatial and temporal coherence. Models hold a consistent world across thousands of frames; you can branch, fork, and replay sessions.

  • KV + latent caching
  • Branching & rewind
  • Deterministic replay
iii.Layer

Model Optimization

Research-grade open-source models, rewritten to production. Custom CUDA kernels, torch.compile pipelines, quantization, and scheduler tricks that make the difference between a demo and a product.

  • Custom CUDA kernels
  • FP8 / INT8 paths
  • Scheduler co-design
iv.Layer

Global GPU Mesh

Designed for a geographically distributed fleet of H100 / B200 class GPUs with session affinity. Users connect to the GPU closest to them; sessions migrate live without dropping a frame.

  • Multi-region by design
  • Live session migration
  • Per-second billing
04 — Design targets

The targets the runtime is being built against.

We're pre-launch, so these are commitments, not retroactive marketing. They're the numbers the architecture, the kernels, and the SDK were designed around. When the platform ships, our public benchmarks will be published against this same list, in the open.

< 50
ms
Per-frame latency target, region-local
0
fps
Sustained interactive throughput at 720p
0+
frames
Coherent context window we're designing toward
0
models
Real open-source video / world models in the launch catalog
Per-sec
billing
Pay only while a session is streaming, down to the second
Apache
first
We prioritise permissive weights; Tencent / OpenRAIL where licensed
05 — Developer experience

Open a session. Push a control signal. Receive frames.

The full SDK is a thin wrapper around our streaming protocol — explicit about what is happening, opinionated about what should be invisible. There is no GPU to choose, no model server to deploy, no retry logic to glue.

  • First-class SDKs for TypeScript and Python; Rust and Swift in development
  • Zero-config WebRTC negotiation in the browser and on device
  • Session tokens scoped to model, region, and per-second budget
examples / interactive-session.ts
06 — Model catalog

Real open-source models, hardened for streaming.

Browse full catalog →
Real-time long video

Helios

Interactive real-time video generation with infinite streaming. The Helios family is our target workload for game-studio integrations on Windflow.

Latency target47 ms
Action-controlled world model

LingBot

High-fidelity environments with action-conditioned, real-time interactive output. Open weights, optimized end-to-end on Windflow.

Latency target52 ms
Multi-shot video

LongLive

Multi-shot interactive video generation. Hold narrative state across scene cuts while the user keeps steering.

Latency target58 ms
Causal video diffusion

CausVid

Autoregressive video diffusion — low-latency text-to-video and frame-accurate restyle, background replace, virtual production.

Latency target38 ms
07 — Where we're aimed

Three categories of workload, one streaming runtime.

01 / Gaming

Worlds that generate themselves

Designed for studios prototyping titles where environments — terrain, weather, NPC behaviour — are generated frame-by-frame in response to player action. Windflow is the runtime we're building underneath.

02 / Simulation

Physical AI without the lab

A target workload for robotics and autonomy teams who want to train and evaluate policies inside generated worlds that respond to the same control signals as the physical fleet — in real time, in the loop.

03 / Creative tools

Editing video the way you edit text

For creative-software teams who want cursor-speed video transforms — restyle, replace, regenerate — behind their canvas, by streaming a Windflow session instead of operating a model server.

08 — The team

Built by people who have shipped real-time systems.

A small, senior team across streaming infrastructure, GPU systems, and developer platforms. Full bios go live with our public launch; in the meantime, the shape of the team we are assembling:

Systems · Streaming runtime

Previously built low-latency video infrastructure at hyperscale; deep WebRTC and CUDA background.

ML · Model optimization

Shipped diffusion and DiT inference work in production; custom CUDA kernels and torch.compile pipelines.

Developer experience

Built and operated developer platforms used by tens of thousands of engineers.

We're hiring across systems, ML, and developer experience.
hello@windflow.dev
Join the private beta

Real-time inference is becoming a primitive. Build like it.

We're opening the platform to a small number of design partners — teams building games, simulators, and creative tools where sub-50 ms is the difference between magic and unusable. Pricing will be per GPU-second, with no committed minimums.