Four layers. One streaming endpoint.
The Windflow runtime is built bottom-up for the workload — every layer assumes you are running a persistent, bidirectional, frame-paced session. Use the stack end-to-end, or pull individual primitives into an existing infrastructure.
Streaming Runtime
A bidirectional transport designed around frames, not requests. Control input goes up the wire while generated frames come down — in the same socket, at the same time, with backpressure.
- Protocols
- WebRTC, WebSocket, gRPC-Web
- Frame budget
- < 50 ms end-to-end, region-local
- Concurrency
- 1,000+ sessions per region
- Transport
- QUIC-tuned for lossy mobile networks
Session State
A persistence layer that keeps your model coherent across thousands of frames. Sessions are first-class objects: branchable, replayable, migratable across regions.
- Context
- Up to 1,840 frames per session
- Branching
- Fork mid-session, replay deterministically
- Storage
- Latent + KV cache, tiered SSD / NVMe
- Migration
- Live region failover, no dropped frame
Model Optimization
Open-source research weights, made production-ready. Custom CUDA kernels, torch.compile pipelines, FP8 quantization, and step-skipping schedulers — the unglamorous work that turns a 2 fps demo into a 24 fps product.
- Kernels
- Custom CUDA for attention + UNet blocks
- Quantization
- FP8, INT8 weight-only, mixed precision
- Schedulers
- Step-skipping with quality guardrails
- Compilation
- torch.compile + TensorRT export
Global GPU Mesh
A geographically distributed fleet of H100 and B200 GPUs with session affinity. Users connect to their nearest region; sessions move between regions live when capacity or latency demands it.
- Regions
- 12 active, 4 more in 2026
- Hardware
- H100 80GB and B200 180GB pools
- Routing
- Latency-aware, capacity-aware
- Billing
- Per GPU-second, no minimums