Distilling PEER models to int4 requires only a second int4 tensor for gradient accumulation. This method uses stochastic rounding to maintain performance while drastically reducing VRAM usage. By packing values into int8 tensors, researchers can fit more experts per gigabyte. This approach simplifies interpretability-by-design for practitioners working with limited hardware.