Day-0 Support for the Qwen3.5 Family

What we found inside Qwen3.5's hybrid Mamba-Transformer weights and what it took to make the gated delta rule fast on GH200 — from matrix-valued recurrences to mixed batch state corruption.

Cumulus Labs

March 8, 2026

10 min read

#inference #mamba #qwen3.5 #cuda #grace-hopper #linear-attention

When Qwen3.5 dropped, we opened the config and found something unexpected:

json

"layer_types": ["linear_attention", "linear_attention", "linear_attention", "full_attention", ...]

64 layers. 48 labeled "linear_attention." 16 standard transformer attention. Only 25% of the model uses a KV cache. The rest uses something else entirely.

We run ionattention, a C++ inference engine built from scratch for GH200. Every model we support gets a custom adapter — weight mapping, architecture extraction, kernel specs — that feeds into the same C++ executor. We had Qwen3.5 running in production on March 2nd. This post covers what we found inside the weights and what it took to make them fast.

It's Not Mamba

The weights look Mamba-like: A_log, conv1d, dt_bias, in_proj. The HuggingFace config calls it "linear_attention." Reasonable to assume it's Mamba.

It's not. It's a gated delta rule — a different recurrence with different semantics. Mamba's state is a vector per head. The delta rule's state is a matrix per head — specifically [128, 128] for each of 48 heads, per layer, per request. The state acts as a learnable associative memory: written with a delta rule, read with a query.

The recurrence per timestep, per head:

q, k  = L2_norm(Q[t]), L2_norm(K[t])
q     = q / sqrt(Dk)
g     = -exp(A_log) * softplus(a[t] + dt_bias)    # decay, always negative
beta  = sigmoid(b[t])                               # per-token learning rate
S     = exp(g) * S                                   # decay state first
error = v - S^T @ k                                  # prediction error
S     = S + k ⊗ (error * beta)                      # delta rule update
o     = S^T @ q                                      # read from memory

The order matters. An early version retrieved before decaying — S^T @ k then exp(g) * S. Output looked plausible. Passed a casual glance test. Failed numerical comparison against the reference implementation by enough to degrade generation quality over long sequences. Isolating this required tracing the matrix state element-by-element against HuggingFace.

The a and b projections produce 48 scalars each per token — one per V-head. They're tiny GEMMs ([5120, 48]) but they dynamically control how fast each head forgets and how aggressively it writes. This is a sharp departure from classic Mamba, where decay trajectories are governed by fixed or non-dynamic parameters.

The memory cost of the matrix-valued state is significant. For 27B: 48 heads × 128 × 128 × 2 bytes = 1.5 MB per layer per request. Across 48 Mamba layers, that's 72 MB per active sequence. At max batch size 64, the engine permanently allocates ~4.6 GB of SSM state.

The Weight Namespace Trap

Standard causal LMs organize parameters under model.layers.{N}.*. Qwen3.5 wraps the language model inside a multimodal vision-language framework, prefixing everything with model.language_model.layers.{N}.*.

Without intercepting this, a standard inference engine silently bypasses the entire parameter dictionary — uninitialized tensors producing confident but randomized output. Our adapter uses explicit regex mappings to strip the multimodal namespace and remap to the C++ executor's layer arrays.

Two Models in One

Qwen3.5 has two completely independent head configurations coexisting in the same parameter space. The executor cannot assume uniform dimensionality across layers.

	Mamba Layers (48 of 64)	Attention Layers (16 of 64)
Head dimension	128	256
Q/K heads	16	24
V heads	48	4
GQA ratio	3:1	6:1
Memory mechanism	Recurrent matrix state [Dk × Dv]	Standard KV cache
RoPE	None	Partial — 64 of 256 dims

The KV cache covers 25% of layers. SSM state covers the other 75%. This structurally eliminates 75% of the KV memory a pure transformer of equivalent depth would require.

The Mamba Forward Pass

Each Mamba layer runs a six-step pipeline, all in C++, all on the compute stream:

Step 1: Four input projection GEMMs. Rather than a single fused QKV, the delta rule requires four separate pathways:

normed [T, 5120] → in_proj_qkv  → [T, 10240]  (QKV for recurrence)
normed [T, 5120] → in_proj_z    → [T, 6144]   (gate)
normed [T, 5120] → in_proj_a    → [T, 48]     (per-head decay scalar)
normed [T, 5120] → in_proj_b    → [T, 48]     (per-head learning rate)

cuBLAS-LT GEMMs, FP8 where available. The a and b projections are kept in full precision — they carry the control signals for state dynamics.

Step 2: Causal conv1d (kernel_size=4). Depthwise convolution over the QKV output, channels-last [T, 10240]. Provides each token a local receptive field of the preceding 4 tokens before the recurrence.

Prefill writes to a dedicated second buffer — the causal window means output at token t overwrites input needed for token t+1, so in-place execution creates a data race. Decode uses a ring buffer with slot-table indirection for CUDA graph compatibility, safe in-place since there's only one token. After prefill, the last 3 pre-convolved tokens are saved to the ring buffer so the first decode step picks up the right history.

Step 3: Delta rule recurrence. The core of the Mamba layer. Prefill processes the full context, decode dispatches across batch × V-heads × value tiles. Each program loads its state tile from the SSM pool, executes the recurrence, stores back.

Step 4: Per-head RMSNorm on the delta output, treating [T × 48, 128] as rows.

Step 5: Gate multiply. SiLU(z) × normed_delta_out.

Step 6: Output projection. [T, 6144] → [T, 5120], back to hidden size.

After step 6, the Mamba path merges back into the standard residual-add + post-LN path. Both layer types converge at the same fused kernel. MLP runs identically for both.

SSM State Management

Both the delta rule state and the conv ring buffer live in pre-allocated pools:

ssm_state_      [64, 48, 48, 128, 128]  BF16  ≈ 3.8 GB
ssm_conv_state_ [64, 48, 10240, 4]      BF16  ≈ 2.4 GB

Allocated once at startup. Zero malloc during inference. Requests get a slot index on admission, return it on completion.

For CUDA graphs, the slot table (ssm_slot_table_d_) maps batch position → slot on device. The graph captures the layer-base pointer at capture time; per-request indirection happens inside the kernel at replay time via state_ptr = layer_base + slot * stride. Same captured graph, different slots every step.

This is the same technique we described in our ionattention post — GH200's cache-coherent NVLink-C2C lets the CPU update the slot table in managed memory between graph replays, and the GPU reads the new mapping at 900 GB/s with no copy, no patch, no re-capture.

The Attention Layers Are Weird Too

Gated Q. The query projection outputs 2 × q_dim. The first half is Q. The second half is a sigmoid gate applied after attention, before the O projection:

cpp

dispatcher_->launch_sigmoid_mul(gate, attn_output, attn_output, elements, stream);

A custom elementwise CUDA kernel: output[i] = sigmoid(gate[i]) * input[i].

Partial RoPE. head_dim=256 but only 64 dims rotate. A dedicated kernel applies standard RoPE to dims [0..63] and leaves [64..255] untouched. The frequency exponent must use rotary_dim=64 as its base — not head_dim=256. Using the wrong base compresses the frequency bands, destroying the model's capacity for position-dependent reasoning.

Four Norms, Three Formulas

Qwen3.5's RMSNorm initializes weights to zero and applies (1 + weight) × norm(x). Standard RMSNorm initializes to one and applies weight × norm(x). At initialization both evaluate to 1.0. Post-training they diverge completely. Applying standard RMSNorm to zero-initialized weights collapses activations to zero.

We added RESIDUAL_WEIGHT as a compile-time constant in the normalization kernel:

if RESIDUAL_WEIGHT:
    w = 1.0 + w
result = x_norm * w

But the model has four normalization contexts, and they don't all use the same formula:

Kernel	Context	RESIDUAL_WEIGHT
layernorm	Pre/post layer norms	True
fusedlnresidual	Fused residual + norm	True
qk_norm	Per-head Q/K norm (attention)	True
mamba_norm	Per-head norm (Mamba block)	False

The Mamba internal norm uses a different class (Qwen3_5RMSNormGated) that reverts to the standard formula. Getting this wrong doesn't crash or produce NaN — it quietly degrades generation quality in ways you won't catch without logit-level numerical comparison.

FP8: What You Can and Can't Quantize

The large matmuls (in_proj_qkv, in_proj_z, out_proj) take FP8 well — they dominate compute and benefit from Hopper's tensor cores. The Mamba control parameters don't survive quantization:

A_log, dt_bias — state decay parameters. FP8 truncates the temporal resolution, causing the memory to forget too rapidly or fail to overwrite stale tokens.
in_proj_a, in_proj_b — 48-dimensional per-head scalars. Zero latency gain, meaningful precision loss.
conv1d — [d_inner, 4]. Tiny footprint, not worth quantizing.

These are permanently excluded from the FP8 pipeline.

The 35B MoE Cascade

Scaling from the dense 27B to the 35B-A3B MoE introduced a stack of overlapping issues.

The FP8 checkpoint is broken. The official Qwen/Qwen3.5-35B-A3B-FP8 on HuggingFace produces garbage output — not just in our engine, but in HuggingFace transformers and MLX as well. The calibration scales appear corrupted. The only viable path: download the BF16 weights and self-quantize with our own pipeline, respecting the Mamba precision exclusions.

Fused expert weights. The BF16 checkpoint packs all 256 experts into contiguous blocks — experts.gate_up_proj [256, 1024, 2048]. Our MoE kernels expect per-expert tensors. A post_load_transforms hook splits these on load: 256 experts × (gate + up) = 512 tensor registrations per layer.

Shared expert dimension mismatch. The shared expert GEMM reads config_.intermediate_size for its matrix dimensions. The adapter inherited 17408 from the dense parent class. The shared expert's actual intermediate is 512. With the wrong value, the GEMM reads 34× past the weight boundary — immediate CUDA_ILLEGAL_ADDRESS. One line fix, several hours to find.

Mixed Batch State Corruption

With chunked prefill, a batch contains both decoding requests (1 token each) and prefilling requests (many tokens). The initial implementation aggregated all tokens into a single sequence and used the first request's SSM slot for state writes.

If the first request was decoding, its associative memory got overwritten with prefill tokens from a different sequence. The prefill request's state wrote to the wrong slot. Both produced wrong output — but only under concurrent load, when batches are actually mixed.

Fix: fork the execution at every Mamba operation. Decode requests route through the slot-table-indirected decode path. Prefill requests get isolated execution with their own token ranges and dedicated slots.

Buffer Overflow at T≈280

During stress testing of the 35B model, prefill contexts around 280 tokens triggered CUDA_ILLEGAL_ADDRESS inside fused_ln_residual. The workspace pool sizes scratch buffers based on intermediate_size — for the MoE model, that's 512 (shared expert), giving ~4.2 MB.

Mamba needs [T, qkv_dim] at qkv_dim=10240. At T=280: 4.6 MB. Silent overflow into adjacent workspace memory.

Fix: Mamba operations are permanently firewalled from the MLP workspace pool. Dedicated buffers, sized for max_tokens × qkv_dim. Pool buffers are for MLP intermediates only.

What This Means

Qwen3.5 is the first major hybrid Mamba-Transformer model to ship at scale. The architecture has real advantages: 75% KV memory reduction, O(1) per-token decode cost for the majority of layers, and competitive generation quality.

But "Mamba" isn't Mamba anymore. The weights carry familiar names, the config says "linear_attention," but the actual recurrence is a matrix-valued delta rule with L2-normalized queries and keys, dynamic decay, and a learned write gate. If you're building inference infrastructure, you can't pattern-match on weight names. You have to read the forward pass.

We have both the dense 27B and MoE 35B-A3B in production on GH200 — CUDA graphs, mixed-model serving, sub-second switching. The hybrid architecture fits naturally into ionattention's existing design: per-layer graphs handle Mamba/attention dispatch transparently, SSM state pools reuse the same slot-table indirection as KV cache blocks, and GH200's coherent memory makes all of it graph-compatible at zero overhead.

If you're running open-source models and want inference built around your hardware, come talk to us.