A 7B model from ByteDance handles image-heavy documents four times longer than its training data. It replaces traditional transcription with a question-answering approach to locate key passages. This method proves more reliable than using larger models for long-form analysis. Practitioners can now achieve higher accuracy with smaller, more efficient model architectures.