OpenAI models occasionally produce incomprehensible snippets like “parted disclaim marinade” while still delivering correct answers. Researchers from Apollo Research and METR are investigating if this behavior is load-bearing for task performance. Current LLM-as-judge strategies fail to detect these patterns. Solving this is critical for auditing model transparency and alignment.