Four distinct tasks and nine datasets form a new benchmark from Apple Machine Learning Research. The study probes how generative models handle specific linguistic contextual features. While LLMs appear capable, this framework provides a more rigorous measurement of their actual comprehension. It offers a baseline for developers to identify specific linguistic failures.