solvers.CompressionModel

solvers.CompressionModel()

Analyzes model compression trade-offs (Accuracy vs. Efficiency).

This model simulates the ‘Compression Tax’ — the accuracy degradation that occurs when reducing model size via quantization or pruning, balanced against the gains in memory footprint and inference latency.

Literature Source: 1. Han et al. (2015), “Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding.” 2. Gholami et al. (2021), “A Survey of Quantization Methods for Efficient Neural Network Inference.” 3. Blalock et al. (2020), “What is the State of Neural Network Pruning?”

Methods

Name	Description
solve	Solves for compression gains and estimated accuracy impact.

solve

solvers.CompressionModel.solve(
    model,
    hardware,
    method='quantization',
    target_bitwidth=8,
    sparsity=0.0,
    sparsity_type='unstructured',
)

Solves for compression gains and estimated accuracy impact.

Parameters

Name	Type	Description	Default
model	Workload	The model to be compressed.	required
hardware	HardwareNode	The target execution hardware.	required
method	str	The compression method (‘quantization’, ‘pruning’, ‘distillation’).	`'quantization'`
target_bitwidth	int	Target numerical precision in bits (e.g., 8 for INT8/FP8, 4 for INT4). At 8-bit, accuracy delta uses the FP8 estimate (near-lossless) by default.	`8`
sparsity	float	Target sparsity ratio (0.0 to 1.0) for pruning.	`0.0`
sparsity_type	str	Type of sparsity pattern: ‘unstructured’, ‘structured’, or ‘n_m’ (2:4). - unstructured: storage savings only, no inference speedup - structured: both storage and compute savings - n_m: hardware 2:4 sparsity with 2x speedup at 50% sparsity (Ampere+)	`'unstructured'`

Returns

Name	Type	Description
	CompressionResult	Compression metrics including memory savings, inference speedup, and estimated accuracy delta.