How we got attention down to 0.9 ms per frame on H100
A walk through the kernel rewrite that pulled six milliseconds out of every frame on the Helios hot path, and the eval harness we used to make sure quality did not move.
L. Verma · Kernel Engineering