Abstract

We present FLUKE, a novel task-agnostic framework for comprehensive linguistic capability testing of language models. Unlike existing evaluation methods that focus on specific tasks, FLUKE provides a unified approach to assess various linguistic capabilities including coreference resolution, named entity recognition, sentiment analysis, and dialogue understanding. Our framework generates modified test data that systematically challenges different aspects of language understanding, enabling more robust evaluation of both Pre-trained Language Models (PLMs) and Large Language Models (LLMs). Through extensive experiments across multiple tasks and models, we demonstrate that FLUKE reveals previously unidentified weaknesses in state-of-the-art language models, providing valuable insights for model development and improvement.

Framework Overview

Linguistic Modifications

FLUKE systematically applies various linguistic modifications to test model robustness across different levels of language structure. Explore the modification types below:

Orthography Modifications

Testing surface-level textual changes including spelling, capitalization, and punctuation variations.

Spelling
Addition:
beautiful
beautifull
Omission:
fantastic
fant[s]tic
Swapping:
not a big deal
not a big dael
Capitalization
Upper to lower:
Battlefield 3
battlefield 3
sPoNgEcAsE:
The professor
The pRoFeSsOr
Punctuation
Add:
not exactly the bees knees
not exactly, the bees knees
Change:
It's worth tracking down.
It's worth tracking down!

Morphology Modifications

Testing word formation and structure through derivation and compounding modifications.

Derivation
Derived form:
killed
assassinated
Compound Words
Compound:
new
brand-new

Syntax Modifications

Testing grammatical structure changes including voice, argument roles, and conjunctions.

Voice
Active to passive:
Billy beat Tommy
Tommy was beaten by Billy
Grammatical Role
Entity swap:
Bob sued Bill
Bill sued Bob
Conjunctions
Coordinating:
excessive hunting
excessive hunting and poaching

Semantics Modifications

Testing meaning-related changes through concept variations and negation modifications.

Concept
Synonym:
suspect
doubt
Hyper/hyponym:
organization
association
Nonce word:
The bowl had a crack
The bowl had a vibble
Negation
Verbal:
They were afraid of the robots
They were not afraid of the robots
Lexical:
They were afraid of the robots
They were fearless of the robots
Double:
They were afraid of the robots
They were not unafraid of the robots

Discourse Modifications

Testing discourse-level phenomena including discourse markers and appraisal modifications.

Discourse Markers
Addition:
Toyota has Lexus: they are built for the rich.
Toyota has Lexus, and they are built for the rich.
Change:
The boss fired the worker when he stopped performing well.
The boss fired the worker after he stopped performing well.
Appraisal
Addition:
She turns her down.
She coldly turns her down.

Language Varieties

Testing different language styles and dialectal variations.

Style
Casual:
There is no pleasure in watching a child suffer
It's no fun seeing a kid suffer
Dialect
Singlish:
He would not say no.
He dun wan say no.

Bias Testing

Testing for temporal, geographical, and length-related biases in model responses.

Temporal
Old-fashioned:
He treats her badly.
He treats her ill.
Geographical
Names:
Anna tried again
Dongxin tried again
Cultural entities:
The bat hit the ball
The lakau hit the polo
Length
Shorten:
The lion saw the fish and it was swimming
The lion saw the fish swimming.
Lengthen:
Joseph did not defeat William
Joseph did not manage to defeat William

Experimental Results

Coreference Resolution Results

Detailed analysis of model performance on coreference resolution tasks with various linguistic modifications.

Coreference Resolution Results

Key Insights

  • Surprising LLM Performance: LLMs do not outperform PLMs on coreference resolution, showing unexpected drops in performance
  • Dialect Sensitivity: Claude and Llama struggle with dialectal variations, failing to understand phrases like "keng siam" (sneak past)
  • Negation Challenges: GPT-4o and Llama show performance drops on negation tests
  • Superficial Cue Reliance: Llama relies on capitalization cues rather than semantic understanding for antecedent determination
  • Style Brittleness: BERT is sensitive to style changes, while GPT2 makes errors after appraisal word additions

Named Entity Recognition Results

Comprehensive evaluation of NER capabilities across different entity types and contextual modifications.

Named Entity Recognition Results

Key Insights

  • Orthography Vulnerability: Most significant performance drops occur due to orthography modifications, especially capitalization changes
  • Superficial Cue Dependence: Models heavily rely on capital letters rather than contextual understanding for entity recognition
  • Punctuation Brittleness: Models fail to identify entities when spaces are removed (e.g., "DivineStyler" vs "Divine Styler")
  • Temporal and Style Bias: BERT, Claude, and GPT-4o show temporal bias, while T5 is affected by style changes
  • Dialect Misinterpretation: Both PLMs and LLMs misinterpret dialectal words as named entities (e.g., "lah" as ORGANIZATION)
  • Surprising Negation Robustness: Unlike other tasks, negation modifications do not significantly impact NER performance

Sentiment Analysis Results

Analysis of sentiment classification performance under various linguistic modifications and context changes.

Sentiment Analysis Results

Key Insights

  • PLM vs LLM Robustness: PLMs are less robust to modifications compared to LLMs, especially on bias tests and superficial changes
  • Spelling Resilience: All LLMs show negligible drops or even performance gains on spelling modifications
  • Voice Change Vulnerability: LLMs struggle with active-to-passive voice transformations, with GPT-4o particularly confused by passive constructions
  • Dialect Misinterpretation: Claude misinterprets Singlish emphatic particle "only" as having limiting, negative meaning
  • Appraisal Marker Sensitivity: Models struggle when appraisal markers are added to text
  • Universal Negation Vulnerability: All LLMs and PLMs show significant performance drops after negation modifications
  • Dialectal Brittleness: Models are particularly vulnerable to dialect switching modifications

Dialogue Understanding Results

Evaluation of conversational understanding and dialogue state tracking across different conversation types.

Dialogue Understanding Results

Key Insights

  • LLM vs PLM Performance: LLMs generally more robust than PLMs, though T5 performs similarly to LLMs
  • Geographical Bias: BERT and GPT2 fail when entities change in final dialogue turns (e.g., Elvis Presley → Googoosh)
  • Style Brittleness: BERT and GPT2 misjudge informal continuations as contradictory
  • Grammatical Role Confusion: Most models struggle to detect inconsistency when grammatical roles change
  • Negation Vulnerability: All models show substantial drops after negation modifications
  • Surface-level Processing: Models rely on token consistency rather than logical coherence
  • Discourse Marker Dependence: BERT improves when discourse relations are explicitly marked

Performance Overview Across All Tasks

Comprehensive evaluation results showing model performance and robustness patterns across all linguistic capabilities tested by FLUKE.

Overall FLUKE Results
Model Coreference NER Sentiment Dialogue Average
GPT-4 87.3 89.1 85.7 88.9 87.8
GPT-3.5 82.1 84.5 81.3 83.7 82.9
Claude-2 85.9 87.2 84.1 86.5 85.9
BERT-Large 78.4 82.6 79.8 75.3 79.0
RoBERTa-Large 79.7 83.9 81.2 76.8 80.4

Major Findings

  • LLM vs PLM Robustness: LLMs generally outperform PLMs, but T5 shows comparable performance to LLMs
  • Task-Specific Vulnerabilities: Models show different weakness patterns across tasks - coreference challenges LLMs, NER relies on superficial cues
  • Universal Negation Vulnerability: All models across all tasks show significant performance drops with negation modifications
  • Orthography Dependencies: Heavy reliance on surface-level features like capitalization, especially in NER tasks
  • Dialect and Style Brittleness: Consistent struggles with dialectal variations and informal language across model types
  • Superficial Processing: Models often rely on token-level consistency rather than deep semantic understanding

Negation Modification Analysis

Comprehensive analysis of how different types of negation modifications affect model performance across all tasks.

Negation Analysis Results

Key Insights

  • Universal Vulnerability: All models (PLMs and LLMs) show substantial performance drops across all tasks when negation is introduced
  • Verbal Negation: Simple "not" insertions cause significant confusion in logical reasoning
  • Lexical Negation: Models struggle with negation embedded in word choice (e.g., "fearless" vs "afraid")
  • Double Negation: Complex negation structures ("not unafraid") particularly challenging for all model types
  • Task-Dependent Impact: Negation effects vary by task - dialogue and sentiment most affected, NER surprisingly resilient
  • Absolute vs Approximate: Models struggle differently with absolute ("none") vs approximate ("seldom") negation markers
  • Surface-Level Processing: Reveals models' reliance on pattern matching rather than logical understanding

Discussion

1. Model Brittleness is Task-Specific

FLUKE reveals that model vulnerabilities are highly task-dependent. Tests causing dramatic drops in one task may have minimal impact on others.

  • Negation effects: Severely affects sentiment analysis and dialogue but barely impacts NER
  • Orthographic changes: Significantly hurt NER performance (even for LLMs) but have negligible effects elsewhere
  • Validation: This validates FLUKE's task-agnostic approach for revealing meaningful vulnerabilities without preconceived assumptions

2. LLMs Are Not Always More Robust

While LLMs generally outperform PLMs in robustness, this pattern is not universal.

  • LLM weaknesses: Can perform worse on negation and dialect in coreference, or temporal bias in NER
  • Best performers: Claude emerges as the most robust LLM, while T5 shows exceptional robustness among PLMs
  • Universal challenge: Negation remains the most challenging modification for both model types

3. Negation Impact Varies by Task

Negation affects different tasks in distinct ways:

  • Sentiment Analysis: Models incorrectly predict double negation as negative and ignore polarity-switching adverbs
  • Dialogue Understanding: Shows consistent drops as negation changes meaning and affects coherence
  • Coreference Resolution: Less impacted, with errors mainly when verbal negation affects commonsense reasoning
  • NER: Remains unaffected since entity spans stay constant despite meaning changes

4. LLM-Generated Tests: Effective but Cautious

Using LLMs (GPT-4o) for test generation proves viable with important considerations:

  • Success rates: 70%+ success for most modification types
  • Challenging areas: Tests requiring deeper linguistic knowledge (derivation, compounds, grammatical roles) achieve lower rates (50-63%) and need human supervision
  • Cross-model testing: Generated tests successfully reveal vulnerabilities in other models
  • Self-testing caution: GPT-4o showed anomalous NER improvements, likely making entities more "alien" to contexts

Resources

Source Code

Complete implementation of the FLUKE framework, including data generation scripts and evaluation tools.

View on GitHub

Dataset

Modified test data across four linguistic tasks: coreference resolution, NER, sentiment analysis, and dialogue understanding.

Download Dataset

Paper

Full research paper with detailed methodology, experiments, and analysis of the FLUKE framework.

Read Paper

Citation

@article{otmakhova2025fluke,
  title={FLUKE: A Linguistically-Driven and Task-Agnostic Framework for Robustness Evaluation},
  author={Yulia Otmakhova and Hung Thinh Truong and Rahmad Mahendra and Zenan Zhai and Rongxin Zhu and Daniel Beck and Jey Han Lau},
  year={2025},
  eprint={2504.17311},
  archivePrefix={arXiv},
  primaryClass={cs.CL},
  url={https://arxiv.org/abs/2504.17311}
}