Platform

Four layers. One streaming endpoint.

The Windflow runtime is built bottom-up for the workload — every layer assumes you are running a persistent, bidirectional, frame-paced session. Use the stack end-to-end, or pull individual primitives into an existing infrastructure.

Layer i.

Streaming Runtime

A bidirectional transport designed around frames, not requests. Control input goes up the wire while generated frames come down — in the same socket, at the same time, with backpressure.

Protocols: WebRTC, WebSocket, gRPC-Web
Frame budget: < 50 ms end-to-end, region-local
Concurrency: 1,000+ sessions per region
Transport: QUIC-tuned for lossy mobile networks

Layer ii.

Session State

A persistence layer that keeps your model coherent across thousands of frames. Sessions are first-class objects: branchable, replayable, migratable across regions.

Context: Up to 1,840 frames per session
Branching: Fork mid-session, replay deterministically
Storage: Latent + KV cache, tiered SSD / NVMe
Migration: Live region failover, no dropped frame

Layer iii.

Model Optimization

Open-source research weights, made production-ready. Custom CUDA kernels, torch.compile pipelines, FP8 quantization, and step-skipping schedulers — the unglamorous work that turns a 2 fps demo into a 24 fps product.

Kernels: Custom CUDA for attention + UNet blocks
Quantization: FP8, INT8 weight-only, mixed precision
Schedulers: Step-skipping with quality guardrails
Compilation: torch.compile + TensorRT export

Layer iv.

Global GPU Mesh

A geographically distributed fleet of H100 and B200 GPUs with session affinity. Users connect to their nearest region; sessions move between regions live when capacity or latency demands it.

Regions: 12 active, 4 more in 2026
Hardware: H100 80GB and B200 180GB pools
Routing: Latency-aware, capacity-aware
Billing: Per GPU-second, no minimums