Nine datasets across four tasks form a new benchmark by Apple Machine Learning Research to probe linguistic context. Current evaluations often overlook specific contextual features in generative models. This study tests whether LLMs truly grasp nuance or rely on patterns. Practitioners can now better quantify where models fail to maintain coherent situational awareness.