Tag

#performance

1 article tagged with "performance"

#inference4 #model-hosting4 #serverless-gpu3 #cuda2 #grace-hopper2 #gpu2 #pricing2 #gpu-cloud2 #reinforcement-learning1 #fine-tuning1 #visual-generation1 #pipeline1 #mamba1 #qwen3.51 #linear-attention1

February 17, 20266 min read

Ionattention: Grace Hopper–Native Inference

How Cumulus built the fastest GPU inference runtime for NVIDIA GH200 — 7,167 tok/s on a single chip. Coherent CUDA graphs, eager KV writeback, and phantom-tile attention scheduling.

inferencegpugrace-hopper+6

Read article