FLUKE: A Task-Agnostic Framework for Linguistic Capability Testing

Abstract

We present FLUKE, a novel task-agnostic framework for comprehensive linguistic capability testing of language models. Unlike existing evaluation methods that focus on specific tasks, FLUKE provides a unified approach to assess various linguistic capabilities including coreference resolution, named entity recognition, sentiment analysis, and dialogue understanding. Our framework generates modified test data that systematically challenges different aspects of language understanding, enabling more robust evaluation of both Pre-trained Language Models (PLMs) and Large Language Models (LLMs). Through extensive experiments across multiple tasks and models, we demonstrate that FLUKE reveals previously unidentified weaknesses in state-of-the-art language models, providing valuable insights for model development and improvement.

Framework Overview

Linguistic Modifications

FLUKE systematically applies various linguistic modifications to test model robustness across different levels of language structure. Explore the modification types below:

Orthography Modifications

Testing surface-level textual changes including spelling, capitalization, and punctuation variations.

Spelling

Addition:

beautiful

→

beautifull

Omission:

fantastic

→

fant[s]tic

Swapping:

not a big deal

→

not a big dael

Capitalization

Upper to lower:

Battlefield 3

→

battlefield 3

sPoNgEcAsE:

The professor

→

The pRoFeSsOr

Punctuation

Add:

not exactly the bees knees

→

not exactly, the bees knees

Change:

It's worth tracking down.

→

It's worth tracking down!

Morphology Modifications

Testing word formation and structure through derivation and compounding modifications.

Derivation

Derived form:

killed

→

assassinated

Compound Words

Compound:

new

→

brand-new

Syntax Modifications

Testing grammatical structure changes including voice, argument roles, and conjunctions.

Voice

Active to passive:

Billy beat Tommy

→

Tommy was beaten by Billy

Grammatical Role

Entity swap:

Bob sued Bill

→

Bill sued Bob

Conjunctions

Coordinating:

excessive hunting

→

excessive hunting and poaching

Semantics Modifications

Testing meaning-related changes through concept variations and negation modifications.

Concept

Synonym:

suspect

→

doubt

Hyper/hyponym:

organization

→

association

Nonce word:

The bowl had a crack

→

The bowl had a vibble

Negation

Verbal:

They were afraid of the robots

→

They were not afraid of the robots

Lexical:

They were afraid of the robots

→

They were fearless of the robots

Double:

They were afraid of the robots

→

They were not unafraid of the robots

Discourse Modifications

Testing discourse-level phenomena including discourse markers and appraisal modifications.

Discourse Markers

Addition:

Toyota has Lexus: they are built for the rich.

→

Toyota has Lexus, and they are built for the rich.

Change:

The boss fired the worker when he stopped performing well.

→

The boss fired the worker after he stopped performing well.

Appraisal

Addition:

She turns her down.

→

She coldly turns her down.

Language Varieties

Testing different language styles and dialectal variations.

Style

Casual:

There is no pleasure in watching a child suffer

→

It's no fun seeing a kid suffer

Dialect

Singlish:

He would not say no.

→

He dun wan say no.

Bias Testing

Testing for temporal, geographical, and length-related biases in model responses.

Temporal

Old-fashioned:

He treats her badly.

→

He treats her ill.

Geographical

Names:

Anna tried again

→

Dongxin tried again

Cultural entities:

The bat hit the ball

→

The lakau hit the polo

Length

Shorten:

The lion saw the fish and it was swimming

→

The lion saw the fish swimming.

Lengthen:

Joseph did not defeat William

→

Joseph did not manage to defeat William

Experimental Results

Coreference Resolution Results

Detailed analysis of model performance on coreference resolution tasks with various linguistic modifications.

Key Insights

Surprising LLM Performance: LLMs do not outperform PLMs on coreference resolution, showing unexpected drops in performance
Dialect Sensitivity: Claude and Llama struggle with dialectal variations, failing to understand phrases like "keng siam" (sneak past)
Negation Challenges: GPT-4o and Llama show performance drops on negation tests
Superficial Cue Reliance: Llama relies on capitalization cues rather than semantic understanding for antecedent determination
Style Brittleness: BERT is sensitive to style changes, while GPT2 makes errors after appraisal word additions

Named Entity Recognition Results

Comprehensive evaluation of NER capabilities across different entity types and contextual modifications.

Key Insights

Orthography Vulnerability: Most significant performance drops occur due to orthography modifications, especially capitalization changes
Superficial Cue Dependence: Models heavily rely on capital letters rather than contextual understanding for entity recognition
Punctuation Brittleness: Models fail to identify entities when spaces are removed (e.g., "DivineStyler" vs "Divine Styler")
Temporal and Style Bias: BERT, Claude, and GPT-4o show temporal bias, while T5 is affected by style changes
Dialect Misinterpretation: Both PLMs and LLMs misinterpret dialectal words as named entities (e.g., "lah" as ORGANIZATION)
Surprising Negation Robustness: Unlike other tasks, negation modifications do not significantly impact NER performance

Sentiment Analysis Results

Analysis of sentiment classification performance under various linguistic modifications and context changes.

Key Insights

PLM vs LLM Robustness: PLMs are less robust to modifications compared to LLMs, especially on bias tests and superficial changes
Spelling Resilience: All LLMs show negligible drops or even performance gains on spelling modifications
Voice Change Vulnerability: LLMs struggle with active-to-passive voice transformations, with GPT-4o particularly confused by passive constructions
Dialect Misinterpretation: Claude misinterprets Singlish emphatic particle "only" as having limiting, negative meaning
Appraisal Marker Sensitivity: Models struggle when appraisal markers are added to text
Universal Negation Vulnerability: All LLMs and PLMs show significant performance drops after negation modifications
Dialectal Brittleness: Models are particularly vulnerable to dialect switching modifications

Dialogue Understanding Results

Evaluation of conversational understanding and dialogue state tracking across different conversation types.

Key Insights

LLM vs PLM Performance: LLMs generally more robust than PLMs, though T5 performs similarly to LLMs
Geographical Bias: BERT and GPT2 fail when entities change in final dialogue turns (e.g., Elvis Presley → Googoosh)
Style Brittleness: BERT and GPT2 misjudge informal continuations as contradictory
Grammatical Role Confusion: Most models struggle to detect inconsistency when grammatical roles change
Negation Vulnerability: All models show substantial drops after negation modifications
Surface-level Processing: Models rely on token consistency rather than logical coherence
Discourse Marker Dependence: BERT improves when discourse relations are explicitly marked

Performance Overview Across All Tasks

Comprehensive evaluation results showing model performance and robustness patterns across all linguistic capabilities tested by FLUKE.

Model	Coreference	NER	Sentiment	Dialogue	Average
GPT-4	87.3	89.1	85.7	88.9	87.8
GPT-3.5	82.1	84.5	81.3	83.7	82.9
Claude-2	85.9	87.2	84.1	86.5	85.9
BERT-Large	78.4	82.6	79.8	75.3	79.0
RoBERTa-Large	79.7	83.9	81.2	76.8	80.4

Major Findings

LLM vs PLM Robustness: LLMs generally outperform PLMs, but T5 shows comparable performance to LLMs
Task-Specific Vulnerabilities: Models show different weakness patterns across tasks - coreference challenges LLMs, NER relies on superficial cues
Universal Negation Vulnerability: All models across all tasks show significant performance drops with negation modifications
Orthography Dependencies: Heavy reliance on surface-level features like capitalization, especially in NER tasks
Dialect and Style Brittleness: Consistent struggles with dialectal variations and informal language across model types
Superficial Processing: Models often rely on token-level consistency rather than deep semantic understanding

Negation Modification Analysis

Comprehensive analysis of how different types of negation modifications affect model performance across all tasks.

Key Insights

Universal Vulnerability: All models (PLMs and LLMs) show substantial performance drops across all tasks when negation is introduced
Verbal Negation: Simple "not" insertions cause significant confusion in logical reasoning
Lexical Negation: Models struggle with negation embedded in word choice (e.g., "fearless" vs "afraid")
Double Negation: Complex negation structures ("not unafraid") particularly challenging for all model types
Task-Dependent Impact: Negation effects vary by task - dialogue and sentiment most affected, NER surprisingly resilient
Absolute vs Approximate: Models struggle differently with absolute ("none") vs approximate ("seldom") negation markers
Surface-Level Processing: Reveals models' reliance on pattern matching rather than logical understanding

Discussion

1. Model Brittleness is Task-Specific

FLUKE reveals that model vulnerabilities are highly task-dependent. Tests causing dramatic drops in one task may have minimal impact on others.

Negation effects: Severely affects sentiment analysis and dialogue but barely impacts NER
Orthographic changes: Significantly hurt NER performance (even for LLMs) but have negligible effects elsewhere
Validation: This validates FLUKE's task-agnostic approach for revealing meaningful vulnerabilities without preconceived assumptions

2. LLMs Are Not Always More Robust

While LLMs generally outperform PLMs in robustness, this pattern is not universal.

LLM weaknesses: Can perform worse on negation and dialect in coreference, or temporal bias in NER
Best performers: Claude emerges as the most robust LLM, while T5 shows exceptional robustness among PLMs
Universal challenge: Negation remains the most challenging modification for both model types

3. Negation Impact Varies by Task

Negation affects different tasks in distinct ways:

Sentiment Analysis: Models incorrectly predict double negation as negative and ignore polarity-switching adverbs
Dialogue Understanding: Shows consistent drops as negation changes meaning and affects coherence
Coreference Resolution: Less impacted, with errors mainly when verbal negation affects commonsense reasoning
NER: Remains unaffected since entity spans stay constant despite meaning changes

4. LLM-Generated Tests: Effective but Cautious

Using LLMs (GPT-4o) for test generation proves viable with important considerations:

Success rates: 70%+ success for most modification types
Challenging areas: Tests requiring deeper linguistic knowledge (derivation, compounds, grammatical roles) achieve lower rates (50-63%) and need human supervision
Cross-model testing: Generated tests successfully reveal vulnerabilities in other models
Self-testing caution: GPT-4o showed anomalous NER improvements, likely making entities more "alien" to contexts

Resources

Source Code

Complete implementation of the FLUKE framework, including data generation scripts and evaluation tools.

View on GitHub

Dataset

Modified test data across four linguistic tasks: coreference resolution, NER, sentiment analysis, and dialogue understanding.

Download Dataset

Paper

Full research paper with detailed methodology, experiments, and analysis of the FLUKE framework.

Read Paper

Abstract

Framework Overview

Linguistic Modifications

Orthography Modifications

Spelling

Capitalization

Punctuation

Morphology Modifications

Derivation

Compound Words

Syntax Modifications

Voice

Grammatical Role

Conjunctions

Semantics Modifications

Concept

Negation

Discourse Modifications

Discourse Markers

Appraisal

Language Varieties

Style

Dialect

Bias Testing

Temporal

Geographical

Length

Experimental Results

Coreference Resolution Results

Key Insights

Named Entity Recognition Results

Key Insights

Sentiment Analysis Results

Key Insights

Dialogue Understanding Results

Key Insights

Performance Overview Across All Tasks

Major Findings

Negation Modification Analysis

Key Insights

Discussion

1. Model Brittleness is Task-Specific

2. LLMs Are Not Always More Robust

3. Negation Impact Varies by Task

4. LLM-Generated Tests: Effective but Cautious

Resources

Source Code

Dataset

Paper

Citation