Fearless Concurrency on the GPU: Safe GPU inference in Rust, competitive with vLLM/SGLang [R]

I maintain cuTile Rust and just posted the paper "Fearless Concurrency on the GPU."

As more GPU code gets AI-generated, the bottleneck moves from writing it to trusting it. cuTile Rust lets you write or generate GPU kernels whose memory safety and data-race freedom are verified by the compiler, through Rust's ownership and borrow checking. You get those guarantees by construction. It's a tile-based programming model that lowers to CUDA Tile IR, carrying Rust's ownership model across the launch boundary. You partition a mutable output into disjoint mutable sub-tensors, pass inputs as shared references, and write tile kernels with single-threaded semantics that the compiler maps to thread blocks.

End to end, we built Grout, a Qwen3 inference engine, on cuTile Rust with Hugging Face. At batch-1 decode it reaches 171 tok/s for Qwen3-4B on an RTX 5090 and 82 tok/s for Qwen3-32B on a B200, competitive with vLLM and SGLang. Batch-1 decode is memory-bandwidth-bound, and Grout's throughput is consistent with our HBM roofline analysis.

Many of Grout's kernels still use the unsafe path today, but they can be migrated to safe variants, providing a verifiable target for generated kernels. We've started a collection of such kernels in the cutile-kernels crate in the repo. If this is your thing, contributing safe variants helps grow a library of safe, high-performance kernels that future kernel synthesis can draw from.

On the kernel side, the safety is effectively free. On a B200 the safe GEMM is within 0.3% of a hand-written low-level version (~92% of dense f16 peak), and element-wise hits ~7 TB/s, matching cuTile Python within measurement noise.

Some additional caveats worth noting: Grout is batch-1 with a small set of supported models (a research case study, not a drop-in server), it's NVIDIA-only (lowers to Tile IR), and GEMM still slightly trails cuBLAS at some sizes.

- Paper: https://arxiv.org/abs/2606.15991
- Code: https://github.com/nvlabs/cutile-rs
- Grout: https://github.com/huggingface/grout

Hope you enjoy the paper and learn something new! Happy to answer any questions :)

submitted by /u/Exciting_Suspect9088
[link] [comments]

Fearless Concurrency on the GPU: Safe GPU inference in Rust, competitive with vLLM/SGLang [R]

Want to read more?

Tagged with