A new technical guide demonstrates how to accelerate matrix multiplication in Swift from Gflop/s to Tflop/s. The author optimizes memory access and leverages metal shaders to maximize hardware utilization. This approach provides a blueprint for developers building custom LLM training loops on Apple Silicon. It is a niche but useful performance deep-dive.