Preprint

FLUKE: A Linguistically-Driven and Task-Agnostic Framework for Robustness Evaluation

Yulia Otmakhova1*, Hung Thinh Truong1*, Rahmad Mahendra2, Zenan Zhai3, Rongxin Zhu1,3, Daniel Beck2, Jey Han Lau1

1The University of Melbourne   2RMIT University   3Oracle

FLUKE introduces controlled variations across linguistic levels—from orthography to dialect—to systematically evaluate model robustness through minimal modifications of test data.

Abstract

We present FLUKE, a task-agnostic framework for assessing model robustness through systematic minimal variations of test data. FLUKE introduces controlled variations across linguistic levels—from orthography to dialect and style—and leverages large language models with human validation to generate modifications.

We evaluate both fine-tuned models and LLMs across six diverse NLP tasks, revealing that: (1) linguistic variations impact is highly task-dependent; (2) LLMs still exhibit significant brittleness, with reasoning LLMs showing less robustness on some tasks; (3) natural modifications hurt more than corruption-style tests; (4) generation ability doesn't correlate with robustness.

Examples

Modification Types

FLUKE applies controlled linguistic modifications that preserve task labels while testing model robustness.

Capitalization
The B-52 pilot, Major Messinger
The B-52 PILOT, Major Messinger
Punctuation
Fort Wayne Line trains
FortWayne Line trains
Spelling (Typo)
Grosvenor Square in the 1960s
Grosvenor Squar in the 1960s
Derivation
a software company created in 2009
a software company born in 2009
Compound Words
a gated community called Orchid Island
a high-security gated community called Orchid Island
Active→Passive
Back to Basics focuses on genre classics
Genre classics are focused on by Back to Basics
Grammatical Role
ran for the House from New York
ran for New York from the House
Coord. Conjunction
conflicts were making headlines
conflicts were escalating and making headlines
Concept Replacement
wallowing in generic angst
wallowing in generic distress
Negation
developed by Golaem
developed by no company, neither Golaem nor any other
Discourse Markers
Moreover, Santa is actually innocent
Santa is actually innocent
Sentiment/Appraisal
Loyalists recruited from Queens County
Loyalists reluctantly recruited from Queens County
Dialectal
that sounds delicious to me!
that sounds shiok lah!
Casual Style
a 1945 British comedy-drama film
a '45 British comedy-drama flick
Temporal Bias
a gold-digging villainness
a gold-digging villainess
Geographical Bias
The American fishing community
The Nauruan fishing community
Length Bias
He went on to produce Kim Fowley and the BMX Bandits
He produced Kim Fowley and BMX Bandits
Try It

Flukify Your Text

Apply FLUKE modifications using your OpenAI API key. Your key is used directly in-browser and never stored.

Type:
Rationale:
Click to upload or drag and drop
CSV or JSONL file
Orthography
Morphology
Syntax
Semantics
Discourse
Varieties
Biases
Processing...
Note: API calls are made directly from your browser. Your API key is never stored or sent anywhere except OpenAI. For programmatic use, see the Flukify CLI tool.
Key Findings
01

Task-Dependent Impact

Some tests are critical for certain tasks but irrelevant for others. Negation devastates Dialogue (30%) but barely affects NER (7.4%).

02

Reasoning ≠ Robustness

Reasoning LLMs (GPT-5, DeepSeek R1) sometimes show less robustness than base models on classification tasks.

03

Natural > Adversarial

Fluent modifications (syntax, style, dialect) cause more brittleness than corruption tests like letter flipping.

04

Generation ≠ Understanding

A model's ability to use a linguistic feature in generation doesn't predict its robustness to that feature.

Evaluation Setup

Six Tasks, Eight Models, 17 Tests

We evaluate across classification and generation tasks using PLMs, base LLMs, and reasoning LLMs.

Classification

Coreference

KnowRef dataset

Classification

Dialogue

DECODE dataset

Classification

NER

Few-NERD dataset

Classification

Sentiment

SST-2 dataset

Generation

Math (GSM)

GSM8K dataset

Generation

Instruction

IFEval benchmark

Models
BERT GPT-2 T5 GPT-4o Claude 3.5 Llama 3.1 GPT-5 DeepSeek R1
Results

Unrobustness by Task

Unrobustness (U%) measures prediction changes between original and modified instances. Higher = more brittle. Select a task to view detailed results.

Coreference Resolution Classification

CategoryModificationBERTGPT-2T5GPT-4oClaudeLlamaGPT-5DS R1Avg
BiasTemporal9.04.06.08.01.07.03.08.05.8
Geographical8.014.09.010.04.05.06.010.08.2
Length19.215.214.115.220.217.28.113.115.3
OrthographySpelling11.23.17.15.12.04.17.14.15.5
Capitalization15.214.17.17.14.05.12.06.17.6
Punctuation1.07.11.01.03.01.02.04.02.5
MorphologyDerivation4.13.15.14.11.01.05.12.03.2
Compound6.24.23.15.26.26.23.14.24.8
SyntaxVoice35.834.741.126.315.830.59.515.826.2
Grammar30.627.819.422.218.119.416.720.821.9
Conjunction5.210.38.28.24.17.25.26.26.8
SemanticsConcept8.03.05.011.018.09.010.010.09.2
Negation25.523.524.521.424.522.422.430.624.4
DiscourseMarkers8.06.08.019.010.07.09.011.09.8
Appraisal5.08.04.08.07.09.05.07.06.6
VarietiesStyle16.019.018.019.014.014.014.012.015.8
Dialect8.326.922.224.511.816.72.912.715.8
Average12.713.211.912.79.710.77.710.511.1

Voice and Grammar modifications cause highest unrobustness (26%, 22%). GPT-5 shows best overall robustness (7.7% avg).

Dialogue Understanding Classification

CategoryModificationBERTGPT-2T5GPT-4oClaudeLlamaGPT-5DS R1Avg
BiasTemporal8.06.07.02.05.01.03.02.04.2
Geographical6.58.712.013.015.212.018.513.012.4
Length10.06.05.04.06.01.06.07.05.6
OrthographySpelling6.06.04.08.03.02.01.06.04.5
Capitalization0.06.26.23.12.12.14.21.03.1
Punctuation4.02.08.03.03.01.05.02.03.5
MorphologyDerivation0.02.21.14.35.41.15.46.53.2
Compound6.04.03.04.07.01.08.06.04.9
SyntaxVoice9.011.07.07.02.03.09.05.06.6
Grammar14.312.97.120.67.411.816.219.113.7
Conjunction5.04.07.06.01.02.04.04.04.1
SemanticsConcept7.05.04.08.07.02.010.06.06.1
Negation29.039.031.031.030.030.025.025.030.0
DiscourseMarkers5.79.23.43.42.33.45.75.74.9
Appraisal6.010.05.07.06.04.010.011.07.4
VarietiesStyle12.010.012.06.04.02.06.06.07.2
Dialect20.85.79.411.315.17.55.713.211.1
Average8.88.77.88.37.15.18.48.27.8

Negation causes severe brittleness (30% avg). Llama 3.1 shows best robustness (5.1% avg) on dialogue.

Named Entity Recognition Classification

CategoryModificationBERTGPT-2T5GPT-4oClaudeLlamaGPT-5DS R1Avg
BiasTemporal3.73.01.26.24.31.87.710.64.8
Geographical25.127.629.022.326.027.022.027.825.8
Length11.812.213.112.715.76.49.712.611.8
OrthographySpelling4.32.51.46.63.03.97.311.45.0
Capitalization0.919.313.111.38.915.39.113.611.5
Punctuation6.53.74.97.19.17.97.815.07.8
MorphologyDerivation1.94.25.83.72.01.39.58.24.6
Compound3.10.65.23.11.71.07.012.94.3
SyntaxVoice7.810.85.77.55.54.58.311.37.7
Grammar8.015.910.58.33.55.310.812.49.3
Conjunction9.17.67.79.47.58.011.713.29.3
SemanticsConcept5.08.95.36.54.83.98.59.46.5
Negation4.85.26.58.23.94.610.515.27.4
DiscourseMarkers4.71.65.42.21.93.26.38.74.2
Appraisal5.52.64.83.23.12.97.114.05.4
VarietiesStyle11.812.213.13.64.86.47.410.98.8
Dialect12.67.48.45.98.94.77.013.28.5
Average7.48.68.37.56.76.49.313.08.4

Geographical bias causes highest NER brittleness (25.8%). DeepSeek R1 shows highest unrobustness (13% avg).

Sentiment Analysis Classification

CategoryModificationBERTGPT-2T5GPT-4oClaudeLlamaGPT-5DS R1Avg
BiasTemporal5.05.03.00.08.01.03.03.03.5
Geographical5.04.03.09.08.07.04.08.06.0
Length3.02.03.03.09.02.01.01.03.0
OrthographySpelling4.04.00.00.06.03.00.00.02.1
Capitalization0.03.03.01.07.13.00.03.02.5
Punctuation1.02.00.01.03.03.01.01.01.5
MorphologyDerivation3.45.72.32.36.92.33.44.63.9
Compound3.28.43.22.15.31.13.24.23.8
SyntaxVoice9.010.09.04.010.03.05.04.06.8
Grammar4.512.13.04.57.64.54.54.55.7
Conjunction3.06.01.01.03.03.00.00.02.1
SemanticsConcept6.06.05.04.04.02.01.05.04.1
Negation22.920.825.016.717.715.616.717.719.1
DiscourseMarkers3.06.14.01.012.13.02.03.04.3
Sentiment19.016.012.014.015.016.011.010.014.1
VarietiesCasual9.09.05.03.07.06.03.02.05.5
Dialect9.89.87.84.96.95.92.93.96.5
Average6.57.65.34.28.04.83.64.45.6

Negation (19.1%) and Sentiment shifts (14.1%) cause most brittleness. GPT-5 shows best robustness (3.6%).

Grade School Math Generation

CategoryModificationGPT-4oClaude-3.5Llama 3.1GPT-5DS R1Avg
BiasTemporal1.02.04.00.01.01.6
Geographical5.05.07.01.02.04.0
Length4.02.02.00.02.02.0
OrthographySpelling1.00.02.01.00.00.8
Capitalization3.01.05.01.01.02.2
Punctuation0.00.05.01.01.01.4
SemanticsConcept1.00.05.02.01.01.8
Negation15.015.018.07.09.012.8
DiscourseAppraisal0.00.01.01.01.00.6
VarietiesStyle4.03.06.04.03.04.0
Dialect2.04.06.01.02.03.0
SyntaxConjunction0.01.04.01.01.01.4
Voice2.02.06.01.02.02.6
Average2.92.75.51.62.02.9

Math reasoning is most robust overall (2.9% avg). Negation remains the key challenge (12.8%). GPT-5 shows best performance (1.6%).

Instruction Following Generation

CategoryModificationGPT-4oClaude-3.5Llama 3.1GPT-5DS R1Avg
BiasTemporal11.14.012.15.112.18.9
Geographical9.012.014.09.010.010.8
Length11.010.018.06.016.012.2
OrthographyCapitalization8.15.110.19.113.19.1
Punctuation7.13.013.18.112.18.7
Spelling3.14.17.26.28.25.8
SyntaxConjunction11.07.06.08.011.08.6
Voice9.09.010.06.015.09.8
SemanticsConcept9.17.111.17.17.18.3
Negation24.022.024.023.023.023.2
DiscourseAppraisal7.14.111.27.16.17.1
VarietiesStyle18.07.010.09.011.011.0
Dialect13.013.012.011.015.012.8
Average10.88.312.28.812.310.5

IFEval shows highest overall unrobustness (10.5% avg). Negation (23.2%) and Dialect (12.8%) cause most failures.

Llama Model Scaling GSM Task

How does model size affect robustness? We compare Llama 3.1 at 8B, 70B, and 405B parameters on the GSM math task.

CategoryModificationLlama 8BLlama 70BLlama 405BAvg
BiasTemporal8.05.04.05.7
Geographical8.03.07.06.0
Length8.02.02.04.0
OrthographySpelling10.02.02.04.7
Capitalization11.02.05.06.0
Punctuation5.01.05.03.7
SemanticsConcept12.01.05.06.0
Negation22.017.018.019.0
DiscourseAppraisal7.05.01.04.3
VarietiesStyle16.06.06.09.3
Dialect10.06.06.07.3
SyntaxConjunction12.02.04.06.0
Voice11.02.06.06.3
Average10.84.25.56.8

Key finding: Scaling from 8B→70B dramatically improves robustness (10.8%→4.2%), but 70B→405B shows diminishing returns (4.2%→5.5%). Negation remains challenging across all scales (17-22%). Larger models achieve higher accuracy but don't always translate to better robustness.

Color legend: High ≥15%   Medium 10-15%   Low ≤2%

Citation

Cite This Work

@article{otmakhova2025fluke,
  title     = {FLUKE: A Linguistically-Driven and Task-Agnostic
               Framework for Robustness Evaluation},
  author    = {Otmakhova, Yulia and Truong, Hung Thinh and
               Mahendra, Rahmad and Zhai, Zenan and Zhu, Rongxin
               and Beck, Daniel and Lau, Jey Han},
  journal   = {arXiv preprint arXiv:2504.17311},
  year      = {2025}
}