A 7B model from ByteDance outperforms larger models on image-heavy documents by answering questions instead of transcribing text. This approach works even when documents are four times longer than the training data. The method improves retrieval reliability. Practitioners can now train smaller models to handle massive files without expensive, full-page transcriptions.