vllm.model_executor.layers.quantization.turboquant.config ¶
TurboQuant configuration.
TurboQuantConfig dataclass ¶
Configuration for TurboQuant KV-cache quantization.
Applies Hadamard rotation followed by per-coordinate Lloyd-Max scalar quantization for keys, and uniform quantization for values.
Historical note: this is the scalar case of the HIGGS quantization method (Malinovskii et al., "Pushing the Limits of Large Language Model Quantization via the Linearity Theorem", NAACL 2025; preprint arXiv:2411.17525): rotation + optimized grid + optional re-normalization, applied to KV cache compression. A first application of this approach to KV-cache compression is in "Cache Me If You Must: Adaptive Key-Value Quantization for Large Language Models" (Shutova et al., ICML 2025; preprint arXiv:2501.19392). Both these references pre-date the TurboQuant paper.
QJL is intentionally omitted — community consensus (5+ independent groups) found it hurts attention quality by amplifying variance through softmax.
Named presets (use via --kv-cache-dtype): turboquant_k8v4: FP8 keys + 4-bit values, 2.6x, +1.17% PPL turboquant_4bit_nc: 4-bit MSE keys + 4-bit values + NC, 3.8x, +2.71% turboquant_k3v4_nc: 3-bit MSE keys + 4-bit values + NC, ~3.5x, +10.63% turboquant_3bit_nc: 3-bit MSE keys + 3-bit values + NC, 4.9x, +20.59%
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
head_dim | int | Attention head dimension (e.g. 64, 96, 128). | 128 |
key_quant_bits | int | Bits for key quantization. 8 = FP8 keys (no rotation/MSE). 3-4 = Lloyd-Max MSE quantized keys. | 3 |
value_quant_bits | int | Bits per value dimension for uniform quantization. 3 = 8 levels, 4 = 16 levels (default). | 4 |
norm_correction | bool | Re-normalize centroid vectors to unit norm before inverse rotation during dequant. Fixes quantization-induced norm distortion, improving PPL by ~0.8% at 4-bit. | False |
Source code in vllm/model_executor/layers/quantization/turboquant/config.py
44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 | |
effective_value_quant_bits property ¶
effective_value_quant_bits: int
Actual bits used for value storage.
key_mse_bits property ¶
key_mse_bits: int
MSE bits actually used for key quantization (0 if FP8 keys).
key_packed_size property ¶
key_packed_size: int
Packed bytes for a single KEY vector.
FP8 mode (key_quant_bits=8): head_dim bytes (1 byte per element, no overhead).
TQ mode
- MSE indices: ceil(head_dim * key_mse_bits / 8) bytes
- vec_norm: 2 bytes (float16)
mse_bits property ¶
mse_bits: int
MSE quantizer bit-width (determines centroid count: 2^mse_bits).
For MSE key modes, equals key_quant_bits. For FP8 key mode, falls back to value_quant_bits (centroids are still needed for continuation-prefill dequant and decode kernel params).
slot_size property ¶
slot_size: int
Total packed bytes per head per position (key + value combined).
Layout: [key_packed | value_packed]
slot_size_aligned property ¶
slot_size_aligned: int
Slot size rounded up to next even number.
Even-number is required so effective_head_size = slot_size_aligned // 2 is integral.
value_packed_size property ¶
value_packed_size: int
Packed bytes for a single VALUE vector.
Uniform quantization: ceil(head_dim * bits / 8) + 4 bytes (scale + zero fp16).
from_cache_dtype staticmethod ¶
from_cache_dtype(
cache_dtype: str, head_dim: int
) -> TurboQuantConfig
Create config from a named preset.
Valid presets: turboquant_k8v4, turboquant_4bit_nc, etc.
Source code in vllm/model_executor/layers/quantization/turboquant/config.py
get_boundary_skip_layers staticmethod ¶
get_boundary_skip_layers(
model_config: ModelConfig, n: int = 2
) -> list[str]
Layer indices to skip TQ compression (boundary protection).
For hybrid models (attention + Mamba/linear-attention), boundary protection is disabled — hybrids typically have only 8-12 full-attention layers and a hard n=2 on each side would cover ~40 % of them. The dense GSM8K baselines that motivate n=2 don't apply to hybrids.
For dense models, skips first N and last N attention layers. Empirically required for aggressive presets (k3v4_nc, 3bit_nc) — without it GSM8K drops ~30 points on Qwen3-4B.
Source code in vllm/model_executor/layers/quantization/turboquant/config.py
_get_full_attention_layer_indices ¶
_get_full_attention_layer_indices(
model_config: ModelConfig,
) -> list[int]
Global indices of full-attention layers in a hybrid model.
Covers the conventions used across vLLM: layer_types (Qwen3.5/Next), layers_block_type (Jamba/Zamba2), attn_type_list (Minimax).