vllm.model_executor.layers.quantization.utils.nvfp4_emulation_utils ¶
_dequantize_nvfp4_kernel ¶
_dequantize_nvfp4_kernel(
fp4_ptr,
scale_ptr,
global_scale_ptr,
output_ptr,
rows_per_batch: constexpr,
num_blocks: constexpr,
BLOCK_SIZE: constexpr,
has_batch_global_scale: constexpr,
TILE_BLOCKS: constexpr,
)
Triton kernel for NVFP4 dequantization (swizzle=False).
Optimized with 2D tile processing + interleave for coalesced stores.
Source code in vllm/model_executor/layers/quantization/utils/nvfp4_emulation_utils.py
_e2m1_inline ¶
Inline E2M1 lookup using binary tree - 3 levels instead of 7 sequential.
Maps 3-bit magnitude to float: [0.0, 0.5, 1.0, 1.5, 2.0, 3.0, 4.0, 6.0] Uses bit decomposition for fewer comparisons.
Source code in vllm/model_executor/layers/quantization/utils/nvfp4_emulation_utils.py
_e2m1_lookup ¶
Lookup E2M1 float value from 3-bit magnitude.
Source code in vllm/model_executor/layers/quantization/utils/nvfp4_emulation_utils.py
_nvfp4_quant_dequant_kernel ¶
_nvfp4_quant_dequant_kernel(
input_ptr,
output_ptr,
global_scale_ptr,
k: constexpr,
num_blocks: constexpr,
BLOCK_SIZE: constexpr,
FP4_MAX_RECIPROCAL: constexpr,
TILE_BLOCKS: constexpr,
)
Fused NVFP4 quantize-dequantize kernel.
Uses a 2D grid (rows x tiles) to parallelize across both rows and quantization groups within a row. Each program handles TILE_BLOCKS groups at once using vectorized 2D operations.
Source code in vllm/model_executor/layers/quantization/utils/nvfp4_emulation_utils.py
_round_to_fp4 ¶
Round float values to the nearest E2M1 representable value.
Matches the thresholds in the Python cast_to_fp4 exactly.
Source code in vllm/model_executor/layers/quantization/utils/nvfp4_emulation_utils.py
_triton_dequantize_nvfp4 ¶
_triton_dequantize_nvfp4(
tensor_fp4: Tensor,
tensor_sf: Tensor,
global_scale: Tensor,
dtype: dtype,
block_size: int = 16,
) -> Tensor
Dequantize NVFP4 using Triton (swizzle=False only).
Supports both 2D and 3D inputs: - 2D: [m, packed_k] -> [m, k] - 3D: [dim0, m, packed_k] -> [dim0, m, k]
Source code in vllm/model_executor/layers/quantization/utils/nvfp4_emulation_utils.py
_triton_nvfp4_quant_dequant ¶
Triton-accelerated NVFP4 quantize-dequantize.
Source code in vllm/model_executor/layers/quantization/utils/nvfp4_emulation_utils.py
dequantize_to_dtype ¶
dequantize_to_dtype(
tensor_fp4: Tensor,
tensor_sf: Tensor,
global_scale: Tensor,
dtype: dtype,
block_size: int = 16,
swizzle: bool | None = True,
)
Dequantize the fp4 tensor back to high precision.
Supports both 2D and 3D inputs: - 2D: [m, packed_k] -> [m, k] - 3D: [dim0, m, packed_k] -> [dim0, m, k]
Source code in vllm/model_executor/layers/quantization/utils/nvfp4_emulation_utils.py
ref_nvfp4_quant_dequant ¶
NVFP4 quantize-dequantize operation.
global_scale is expected to have a single element.