solvers.CompressionModel

solvers.CompressionModel()

Analyzes model compression trade-offs (Accuracy vs. Efficiency).

This model simulates the ‘Compression Tax’ — the accuracy degradation that occurs when reducing model size via quantization or pruning, balanced against the gains in memory footprint and inference latency.

Literature Source: 1. Han et al. (2015), “Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding.” 2. Gholami et al. (2021), “A Survey of Quantization Methods for Efficient Neural Network Inference.” 3. Blalock et al. (2020), “What is the State of Neural Network Pruning?”

Methods

Name Description
solve Solves for compression gains and estimated accuracy impact.

solve

solvers.CompressionModel.solve(
    model,
    hardware,
    method='quantization',
    target_bitwidth=8,
    sparsity=0.0,
    sparsity_type='unstructured',
)

Solves for compression gains and estimated accuracy impact.

Parameters

Name Type Description Default
model Workload The model to be compressed. required
hardware HardwareNode The target execution hardware. required
method str The compression method (‘quantization’, ‘pruning’, ‘distillation’). 'quantization'
target_bitwidth int Target numerical precision in bits (e.g., 8 for INT8/FP8, 4 for INT4). At 8-bit, accuracy delta uses the FP8 estimate (near-lossless) by default. 8
sparsity float Target sparsity ratio (0.0 to 1.0) for pruning. 0.0
sparsity_type str Type of sparsity pattern: ‘unstructured’, ‘structured’, or ‘n_m’ (2:4). - unstructured: storage savings only, no inference speedup - structured: both storage and compute savings - n_m: hardware 2:4 sparsity with 2x speedup at 50% sparsity (Ampere+) 'unstructured'

Returns

Name Type Description
CompressionResult Compression metrics including memory savings, inference speedup, and estimated accuracy delta.
Back to top