Four distinct tasks and nine datasets form a new benchmark from Apple Machine Learning Research. The study probes whether generative models actually grasp linguistic contextual features or simply mimic patterns. This evaluation fills a gap in standard NLP testing. Practitioners can now better quantify how LLMs handle complex, context-dependent nuances in human language.