Real-time inference · in private beta

The runtime for interactive video and world models.

Windflow is the inference platform for real-time video and world models — persistent session state, bidirectional streaming, sub-50 ms frame latency, delivered as a single API. The runtime is the boring part; we'd rather you spend your time on the application.

Request access See the architecture

< 50 ms

Frame-latency target

0 fps

Sustained interactive

Coherent frames / session

Helios · open worldlive

LingBot · action-controlled worldlive

CausVid · video restylelive

01 — The frame-rate shift

A wind tunnel, not a camera.

The first generation of generative video gave us beautiful frozen frames. The second gave us pre-rendered clips — motion implied, but the loop fixed. Both were cameras: capture, render, ship.

What is arriving now is a different instrument entirely. Models that respond mid-generation, hold state across thousands of frames, and answer a control signal in less time than a monitor takes to refresh. The shift is not faster video — it is video that reacts.

We think of it like the difference between a photograph of an aircraft and the wind tunnel that proved it could fly. One records; the other lets the world push back. The runtime for that medium is what we're building.

The research is here

Real-time video models exist. Production runtimes for them do not.

The last eighteen months produced a wave of models that respond mid-generation and hold state across thousands of frames. They are research artifacts — not APIs. Windflow exists to make this class of model something developers can actually build on.

Google DeepMind

Genie 3

720p · 24 fps interactive worlds

Real-time world model with persistent state across minutes of interaction.

Streaming DiT, 2024

StreamDiT

16 fps text-to-video · single H100

Causal diffusion transformer that produces video as a continuous stream, not a clip.

Causal video diffusion, 2024

CausVid

9.4 fps streaming inference

Autoregressive video diffusion proving sub-second control-to-frame loops are tractable.

Live from the runtime

What sub-50 ms actually looks like.

A single uninterrupted session per clip — no pre-renders, no cherry-picked seeds, no offline upscaling. Hover the monitor to pause auto-advance.

rec · Helios

1280×720·24 fps·frame 00000

p50 latency

47ms

01 / 04 · World model · open world

An infinite open world streamed frame-by-frame from a short prompt. Terrain, foliage and lighting hold across thousands of frames because the session's KV cache lives on the GPU between requests.

Helios · open worldLingBot · action-controlledLongLive · multi-shotCausVid · low-latency diffusionLTX-Video · real-time DiTHunyuanVideo · long horizonSub-50ms frame targetWebRTC + WebSocket transportPer-second GPU billing

02 — Why existing infrastructure breaks

LLM serving stacks were not designed for this.

Inference platforms today are optimized for chat completions — fire a prompt, return a response, end the session. Real-time video models invert almost every assumption these systems were built on.

Axis	LLM inference	Batch diffusion	Windflow
Workload shape	Single prompt → single response	Batch of discrete generations	Persistent bidirectional session
Latency target	First-token in ~300 ms	Tens of seconds per image	Sub-50 ms per frame, sustained
State model	Stateless or short context window	Stateless	Consistent across 1000s of frames
Input channel	Text prompt at start	Prompt + seed	Continuous control: pose, action, audio
Optimization surface	KV cache, speculative decoding	Throughput batching	Custom CUDA + streaming schedulers

03 — The Windflow platform

Four layers, one streaming endpoint.

Each layer is exposed as a primitive. Use them together as a managed runtime, or pull just the one you need into an existing stack.

scroll

i — iv

i.Layer

Streaming Runtime

A bidirectional WebSocket and WebRTC protocol that pipelines control inputs and generated frames in the same session. No polling, no awkward chunked HTTP, no batching latency tax.

WebRTC + WebSocket
< 50 ms RTT budget
Backpressure aware

ii.Layer

Session State

A persistence layer purpose-built for spatial and temporal coherence. Models hold a consistent world across thousands of frames; you can branch, fork, and replay sessions.

KV + latent caching
Branching & rewind
Deterministic replay

iii.Layer

Model Optimization

Research-grade open-source models, rewritten to production. Custom CUDA kernels, torch.compile pipelines, quantization, and scheduler tricks that make the difference between a demo and a product.

Custom CUDA kernels
FP8 / INT8 paths
Scheduler co-design

iv.Layer

Global GPU Mesh

Designed for a geographically distributed fleet of H100 / B200 class GPUs with session affinity. Users connect to the GPU closest to them; sessions migrate live without dropping a frame.

Multi-region by design
Live session migration
Per-second billing

iLayer · 01

Streaming Runtime

A bidirectional WebSocket and WebRTC protocol that pipelines control inputs and generated frames in the same session. No polling, no awkward chunked HTTP, no batching latency tax.

WebRTC + WebSocket
< 50 ms RTT budget
Backpressure aware

iiLayer · 02

Session State

A persistence layer purpose-built for spatial and temporal coherence. Models hold a consistent world across thousands of frames; you can branch, fork, and replay sessions.

KV + latent caching
Branching & rewind
Deterministic replay

iiiLayer · 03

Model Optimization

Research-grade open-source models, rewritten to production. Custom CUDA kernels, torch.compile pipelines, quantization, and scheduler tricks that make the difference between a demo and a product.

Custom CUDA kernels
FP8 / INT8 paths
Scheduler co-design

ivLayer · 04

Global GPU Mesh

Designed for a geographically distributed fleet of H100 / B200 class GPUs with session affinity. Users connect to the GPU closest to them; sessions migrate live without dropping a frame.

Multi-region by design
Live session migration
Per-second billing

04 — Design targets

The targets the runtime is being built against.

We're pre-launch, so these are commitments, not retroactive marketing. They're the numbers the architecture, the kernels, and the SDK were designed around. When the platform ships, our public benchmarks will be published against this same list, in the open.

< 50

Per-frame latency target, region-local

fps

Sustained interactive throughput at 720p

frames

Coherent context window we're designing toward

models

Real open-source video / world models in the launch catalog

Per-sec

billing

Pay only while a session is streaming, down to the second

Apache

first

We prioritise permissive weights; Tencent / OpenRAIL where licensed

05 — Developer experience

Open a session. Push a control signal. Receive frames.

The full SDK is a thin wrapper around our streaming protocol — explicit about what is happening, opinionated about what should be invisible. There is no GPU to choose, no model server to deploy, no retry logic to glue.

First-class SDKs for TypeScript and Python; Rust and Swift in development
Zero-config WebRTC negotiation in the browser and on device
Session tokens scoped to model, region, and per-second budget

examples / interactive-session.ts

06 — Model catalog

Real open-source models, hardened for streaming.

Browse full catalog →

Real-time long video

Helios

Interactive real-time video generation with infinite streaming. The Helios family is our target workload for game-studio integrations on Windflow.

Latency target47 ms

Action-controlled world model

LingBot

High-fidelity environments with action-conditioned, real-time interactive output. Open weights, optimized end-to-end on Windflow.

Latency target52 ms

Multi-shot video

LongLive

Multi-shot interactive video generation. Hold narrative state across scene cuts while the user keeps steering.

Latency target58 ms

Causal video diffusion

CausVid

Autoregressive video diffusion — low-latency text-to-video and frame-accurate restyle, background replace, virtual production.

Latency target38 ms

07 — Where we're aimed

Three categories of workload, one streaming runtime.

01 / Gaming

Worlds that generate themselves

Designed for studios prototyping titles where environments — terrain, weather, NPC behaviour — are generated frame-by-frame in response to player action. Windflow is the runtime we're building underneath.

02 / Simulation

Physical AI without the lab

A target workload for robotics and autonomy teams who want to train and evaluate policies inside generated worlds that respond to the same control signals as the physical fleet — in real time, in the loop.

03 / Creative tools

Editing video the way you edit text

For creative-software teams who want cursor-speed video transforms — restyle, replace, regenerate — behind their canvas, by streaming a Windflow session instead of operating a model server.

08 — The team

Built by people who have shipped real-time systems.

A small, senior team across streaming infrastructure, GPU systems, and developer platforms. Full bios go live with our public launch; in the meantime, the shape of the team we are assembling:

Systems · Streaming runtime

Previously built low-latency video infrastructure at hyperscale; deep WebRTC and CUDA background.

ML · Model optimization

Shipped diffusion and DiT inference work in production; custom CUDA kernels and torch.compile pipelines.

Developer experience

Built and operated developer platforms used by tens of thousands of engineers.

We're hiring across systems, ML, and developer experience.

hello@windflow.dev

Join the private beta

Real-time inference is becoming a primitive. Build like it.

We're opening the platform to a small number of design partners — teams building games, simulators, and creative tools where sub-50 ms is the difference between magic and unusable. Pricing will be per GPU-second, with no committed minimums.

Request access Talk to engineering