Abstract
We present FLUKE, a novel task-agnostic framework for comprehensive linguistic capability testing of language models. Unlike existing evaluation methods that focus on specific tasks, FLUKE provides a unified approach to assess various linguistic capabilities including coreference resolution, named entity recognition, sentiment analysis, and dialogue understanding. Our framework generates modified test data that systematically challenges different aspects of language understanding, enabling more robust evaluation of both Pre-trained Language Models (PLMs) and Large Language Models (LLMs). Through extensive experiments across multiple tasks and models, we demonstrate that FLUKE reveals previously unidentified weaknesses in state-of-the-art language models, providing valuable insights for model development and improvement.
Framework Overview
Linguistic Modifications
FLUKE systematically applies various linguistic modifications to test model robustness across different levels of language structure. Explore the modification types below:
Orthography Modifications
Testing surface-level textual changes including spelling, capitalization, and punctuation variations.
Spelling
Capitalization
Punctuation
Morphology Modifications
Testing word formation and structure through derivation and compounding modifications.
Derivation
Compound Words
Syntax Modifications
Testing grammatical structure changes including voice, argument roles, and conjunctions.
Voice
Grammatical Role
Conjunctions
Semantics Modifications
Testing meaning-related changes through concept variations and negation modifications.
Concept
Negation
Discourse Modifications
Testing discourse-level phenomena including discourse markers and appraisal modifications.
Discourse Markers
Appraisal
Language Varieties
Testing different language styles and dialectal variations.
Style
Dialect
Bias Testing
Testing for temporal, geographical, and length-related biases in model responses.
Temporal
Geographical
Length
Experimental Results
Coreference Resolution Results
Detailed analysis of model performance on coreference resolution tasks with various linguistic modifications.

Key Insights
- Surprising LLM Performance: LLMs do not outperform PLMs on coreference resolution, showing unexpected drops in performance
- Dialect Sensitivity: Claude and Llama struggle with dialectal variations, failing to understand phrases like "keng siam" (sneak past)
- Negation Challenges: GPT-4o and Llama show performance drops on negation tests
- Superficial Cue Reliance: Llama relies on capitalization cues rather than semantic understanding for antecedent determination
- Style Brittleness: BERT is sensitive to style changes, while GPT2 makes errors after appraisal word additions
Named Entity Recognition Results
Comprehensive evaluation of NER capabilities across different entity types and contextual modifications.

Key Insights
- Orthography Vulnerability: Most significant performance drops occur due to orthography modifications, especially capitalization changes
- Superficial Cue Dependence: Models heavily rely on capital letters rather than contextual understanding for entity recognition
- Punctuation Brittleness: Models fail to identify entities when spaces are removed (e.g., "DivineStyler" vs "Divine Styler")
- Temporal and Style Bias: BERT, Claude, and GPT-4o show temporal bias, while T5 is affected by style changes
- Dialect Misinterpretation: Both PLMs and LLMs misinterpret dialectal words as named entities (e.g., "lah" as ORGANIZATION)
- Surprising Negation Robustness: Unlike other tasks, negation modifications do not significantly impact NER performance
Sentiment Analysis Results
Analysis of sentiment classification performance under various linguistic modifications and context changes.

Key Insights
- PLM vs LLM Robustness: PLMs are less robust to modifications compared to LLMs, especially on bias tests and superficial changes
- Spelling Resilience: All LLMs show negligible drops or even performance gains on spelling modifications
- Voice Change Vulnerability: LLMs struggle with active-to-passive voice transformations, with GPT-4o particularly confused by passive constructions
- Dialect Misinterpretation: Claude misinterprets Singlish emphatic particle "only" as having limiting, negative meaning
- Appraisal Marker Sensitivity: Models struggle when appraisal markers are added to text
- Universal Negation Vulnerability: All LLMs and PLMs show significant performance drops after negation modifications
- Dialectal Brittleness: Models are particularly vulnerable to dialect switching modifications
Dialogue Understanding Results
Evaluation of conversational understanding and dialogue state tracking across different conversation types.

Key Insights
- LLM vs PLM Performance: LLMs generally more robust than PLMs, though T5 performs similarly to LLMs
- Geographical Bias: BERT and GPT2 fail when entities change in final dialogue turns (e.g., Elvis Presley → Googoosh)
- Style Brittleness: BERT and GPT2 misjudge informal continuations as contradictory
- Grammatical Role Confusion: Most models struggle to detect inconsistency when grammatical roles change
- Negation Vulnerability: All models show substantial drops after negation modifications
- Surface-level Processing: Models rely on token consistency rather than logical coherence
- Discourse Marker Dependence: BERT improves when discourse relations are explicitly marked
Performance Overview Across All Tasks
Comprehensive evaluation results showing model performance and robustness patterns across all linguistic capabilities tested by FLUKE.

Model | Coreference | NER | Sentiment | Dialogue | Average |
---|---|---|---|---|---|
GPT-4 | 87.3 | 89.1 | 85.7 | 88.9 | 87.8 |
GPT-3.5 | 82.1 | 84.5 | 81.3 | 83.7 | 82.9 |
Claude-2 | 85.9 | 87.2 | 84.1 | 86.5 | 85.9 |
BERT-Large | 78.4 | 82.6 | 79.8 | 75.3 | 79.0 |
RoBERTa-Large | 79.7 | 83.9 | 81.2 | 76.8 | 80.4 |
Major Findings
- LLM vs PLM Robustness: LLMs generally outperform PLMs, but T5 shows comparable performance to LLMs
- Task-Specific Vulnerabilities: Models show different weakness patterns across tasks - coreference challenges LLMs, NER relies on superficial cues
- Universal Negation Vulnerability: All models across all tasks show significant performance drops with negation modifications
- Orthography Dependencies: Heavy reliance on surface-level features like capitalization, especially in NER tasks
- Dialect and Style Brittleness: Consistent struggles with dialectal variations and informal language across model types
- Superficial Processing: Models often rely on token-level consistency rather than deep semantic understanding
Negation Modification Analysis
Comprehensive analysis of how different types of negation modifications affect model performance across all tasks.

Key Insights
- Universal Vulnerability: All models (PLMs and LLMs) show substantial performance drops across all tasks when negation is introduced
- Verbal Negation: Simple "not" insertions cause significant confusion in logical reasoning
- Lexical Negation: Models struggle with negation embedded in word choice (e.g., "fearless" vs "afraid")
- Double Negation: Complex negation structures ("not unafraid") particularly challenging for all model types
- Task-Dependent Impact: Negation effects vary by task - dialogue and sentiment most affected, NER surprisingly resilient
- Absolute vs Approximate: Models struggle differently with absolute ("none") vs approximate ("seldom") negation markers
- Surface-Level Processing: Reveals models' reliance on pattern matching rather than logical understanding
Discussion
1. Model Brittleness is Task-Specific
FLUKE reveals that model vulnerabilities are highly task-dependent. Tests causing dramatic drops in one task may have minimal impact on others.
- Negation effects: Severely affects sentiment analysis and dialogue but barely impacts NER
- Orthographic changes: Significantly hurt NER performance (even for LLMs) but have negligible effects elsewhere
- Validation: This validates FLUKE's task-agnostic approach for revealing meaningful vulnerabilities without preconceived assumptions
2. LLMs Are Not Always More Robust
While LLMs generally outperform PLMs in robustness, this pattern is not universal.
- LLM weaknesses: Can perform worse on negation and dialect in coreference, or temporal bias in NER
- Best performers: Claude emerges as the most robust LLM, while T5 shows exceptional robustness among PLMs
- Universal challenge: Negation remains the most challenging modification for both model types
3. Negation Impact Varies by Task
Negation affects different tasks in distinct ways:
- Sentiment Analysis: Models incorrectly predict double negation as negative and ignore polarity-switching adverbs
- Dialogue Understanding: Shows consistent drops as negation changes meaning and affects coherence
- Coreference Resolution: Less impacted, with errors mainly when verbal negation affects commonsense reasoning
- NER: Remains unaffected since entity spans stay constant despite meaning changes
4. LLM-Generated Tests: Effective but Cautious
Using LLMs (GPT-4o) for test generation proves viable with important considerations:
- Success rates: 70%+ success for most modification types
- Challenging areas: Tests requiring deeper linguistic knowledge (derivation, compounds, grammatical roles) achieve lower rates (50-63%) and need human supervision
- Cross-model testing: Generated tests successfully reveal vulnerabilities in other models
- Self-testing caution: GPT-4o showed anomalous NER improvements, likely making entities more "alien" to contexts
Resources
Source Code
Complete implementation of the FLUKE framework, including data generation scripts and evaluation tools.
View on GitHubDataset
Modified test data across four linguistic tasks: coreference resolution, NER, sentiment analysis, and dialogue understanding.
Download DatasetPaper
Full research paper with detailed methodology, experiments, and analysis of the FLUKE framework.
Read PaperCitation
@article{otmakhova2025fluke,
title={FLUKE: A Linguistically-Driven and Task-Agnostic Framework for Robustness Evaluation},
author={Yulia Otmakhova and Hung Thinh Truong and Rahmad Mahendra and Zenan Zhai and Rongxin Zhu and Daniel Beck and Jey Han Lau},
year={2025},
eprint={2504.17311},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2504.17311}
}