1The University of Melbourne 2RMIT University 3Oracle
FLUKE introduces controlled variations across linguistic levels—from orthography to dialect—to systematically evaluate model robustness through minimal modifications of test data.
We present FLUKE, a task-agnostic framework for assessing model robustness through systematic minimal variations of test data. FLUKE introduces controlled variations across linguistic levels—from orthography to dialect and style—and leverages large language models with human validation to generate modifications.
We evaluate both fine-tuned models and LLMs across six diverse NLP tasks, revealing that: (1) linguistic variations impact is highly task-dependent; (2) LLMs still exhibit significant brittleness, with reasoning LLMs showing less robustness on some tasks; (3) natural modifications hurt more than corruption-style tests; (4) generation ability doesn't correlate with robustness.
FLUKE applies controlled linguistic modifications that preserve task labels while testing model robustness.
Apply FLUKE modifications using your OpenAI API key. Your key is used directly in-browser and never stored.
Some tests are critical for certain tasks but irrelevant for others. Negation devastates Dialogue (30%) but barely affects NER (7.4%).
Reasoning LLMs (GPT-5, DeepSeek R1) sometimes show less robustness than base models on classification tasks.
Fluent modifications (syntax, style, dialect) cause more brittleness than corruption tests like letter flipping.
A model's ability to use a linguistic feature in generation doesn't predict its robustness to that feature.
We evaluate across classification and generation tasks using PLMs, base LLMs, and reasoning LLMs.
KnowRef dataset
DECODE dataset
Few-NERD dataset
SST-2 dataset
GSM8K dataset
IFEval benchmark
Unrobustness (U%) measures prediction changes between original and modified instances. Higher = more brittle. Select a task to view detailed results.
| Category | Modification | BERT | GPT-2 | T5 | GPT-4o | Claude | Llama | GPT-5 | DS R1 | Avg |
|---|---|---|---|---|---|---|---|---|---|---|
| Bias | Temporal | 9.0 | 4.0 | 6.0 | 8.0 | 1.0 | 7.0 | 3.0 | 8.0 | 5.8 |
| Geographical | 8.0 | 14.0 | 9.0 | 10.0 | 4.0 | 5.0 | 6.0 | 10.0 | 8.2 | |
| Length | 19.2 | 15.2 | 14.1 | 15.2 | 20.2 | 17.2 | 8.1 | 13.1 | 15.3 | |
| Orthography | Spelling | 11.2 | 3.1 | 7.1 | 5.1 | 2.0 | 4.1 | 7.1 | 4.1 | 5.5 |
| Capitalization | 15.2 | 14.1 | 7.1 | 7.1 | 4.0 | 5.1 | 2.0 | 6.1 | 7.6 | |
| Punctuation | 1.0 | 7.1 | 1.0 | 1.0 | 3.0 | 1.0 | 2.0 | 4.0 | 2.5 | |
| Morphology | Derivation | 4.1 | 3.1 | 5.1 | 4.1 | 1.0 | 1.0 | 5.1 | 2.0 | 3.2 |
| Compound | 6.2 | 4.2 | 3.1 | 5.2 | 6.2 | 6.2 | 3.1 | 4.2 | 4.8 | |
| Syntax | Voice | 35.8 | 34.7 | 41.1 | 26.3 | 15.8 | 30.5 | 9.5 | 15.8 | 26.2 |
| Grammar | 30.6 | 27.8 | 19.4 | 22.2 | 18.1 | 19.4 | 16.7 | 20.8 | 21.9 | |
| Conjunction | 5.2 | 10.3 | 8.2 | 8.2 | 4.1 | 7.2 | 5.2 | 6.2 | 6.8 | |
| Semantics | Concept | 8.0 | 3.0 | 5.0 | 11.0 | 18.0 | 9.0 | 10.0 | 10.0 | 9.2 |
| Negation | 25.5 | 23.5 | 24.5 | 21.4 | 24.5 | 22.4 | 22.4 | 30.6 | 24.4 | |
| Discourse | Markers | 8.0 | 6.0 | 8.0 | 19.0 | 10.0 | 7.0 | 9.0 | 11.0 | 9.8 |
| Appraisal | 5.0 | 8.0 | 4.0 | 8.0 | 7.0 | 9.0 | 5.0 | 7.0 | 6.6 | |
| Varieties | Style | 16.0 | 19.0 | 18.0 | 19.0 | 14.0 | 14.0 | 14.0 | 12.0 | 15.8 |
| Dialect | 8.3 | 26.9 | 22.2 | 24.5 | 11.8 | 16.7 | 2.9 | 12.7 | 15.8 | |
| Average | 12.7 | 13.2 | 11.9 | 12.7 | 9.7 | 10.7 | 7.7 | 10.5 | 11.1 |
Voice and Grammar modifications cause highest unrobustness (26%, 22%). GPT-5 shows best overall robustness (7.7% avg).
| Category | Modification | BERT | GPT-2 | T5 | GPT-4o | Claude | Llama | GPT-5 | DS R1 | Avg |
|---|---|---|---|---|---|---|---|---|---|---|
| Bias | Temporal | 8.0 | 6.0 | 7.0 | 2.0 | 5.0 | 1.0 | 3.0 | 2.0 | 4.2 |
| Geographical | 6.5 | 8.7 | 12.0 | 13.0 | 15.2 | 12.0 | 18.5 | 13.0 | 12.4 | |
| Length | 10.0 | 6.0 | 5.0 | 4.0 | 6.0 | 1.0 | 6.0 | 7.0 | 5.6 | |
| Orthography | Spelling | 6.0 | 6.0 | 4.0 | 8.0 | 3.0 | 2.0 | 1.0 | 6.0 | 4.5 |
| Capitalization | 0.0 | 6.2 | 6.2 | 3.1 | 2.1 | 2.1 | 4.2 | 1.0 | 3.1 | |
| Punctuation | 4.0 | 2.0 | 8.0 | 3.0 | 3.0 | 1.0 | 5.0 | 2.0 | 3.5 | |
| Morphology | Derivation | 0.0 | 2.2 | 1.1 | 4.3 | 5.4 | 1.1 | 5.4 | 6.5 | 3.2 |
| Compound | 6.0 | 4.0 | 3.0 | 4.0 | 7.0 | 1.0 | 8.0 | 6.0 | 4.9 | |
| Syntax | Voice | 9.0 | 11.0 | 7.0 | 7.0 | 2.0 | 3.0 | 9.0 | 5.0 | 6.6 |
| Grammar | 14.3 | 12.9 | 7.1 | 20.6 | 7.4 | 11.8 | 16.2 | 19.1 | 13.7 | |
| Conjunction | 5.0 | 4.0 | 7.0 | 6.0 | 1.0 | 2.0 | 4.0 | 4.0 | 4.1 | |
| Semantics | Concept | 7.0 | 5.0 | 4.0 | 8.0 | 7.0 | 2.0 | 10.0 | 6.0 | 6.1 |
| Negation | 29.0 | 39.0 | 31.0 | 31.0 | 30.0 | 30.0 | 25.0 | 25.0 | 30.0 | |
| Discourse | Markers | 5.7 | 9.2 | 3.4 | 3.4 | 2.3 | 3.4 | 5.7 | 5.7 | 4.9 |
| Appraisal | 6.0 | 10.0 | 5.0 | 7.0 | 6.0 | 4.0 | 10.0 | 11.0 | 7.4 | |
| Varieties | Style | 12.0 | 10.0 | 12.0 | 6.0 | 4.0 | 2.0 | 6.0 | 6.0 | 7.2 |
| Dialect | 20.8 | 5.7 | 9.4 | 11.3 | 15.1 | 7.5 | 5.7 | 13.2 | 11.1 | |
| Average | 8.8 | 8.7 | 7.8 | 8.3 | 7.1 | 5.1 | 8.4 | 8.2 | 7.8 |
Negation causes severe brittleness (30% avg). Llama 3.1 shows best robustness (5.1% avg) on dialogue.
| Category | Modification | BERT | GPT-2 | T5 | GPT-4o | Claude | Llama | GPT-5 | DS R1 | Avg |
|---|---|---|---|---|---|---|---|---|---|---|
| Bias | Temporal | 3.7 | 3.0 | 1.2 | 6.2 | 4.3 | 1.8 | 7.7 | 10.6 | 4.8 |
| Geographical | 25.1 | 27.6 | 29.0 | 22.3 | 26.0 | 27.0 | 22.0 | 27.8 | 25.8 | |
| Length | 11.8 | 12.2 | 13.1 | 12.7 | 15.7 | 6.4 | 9.7 | 12.6 | 11.8 | |
| Orthography | Spelling | 4.3 | 2.5 | 1.4 | 6.6 | 3.0 | 3.9 | 7.3 | 11.4 | 5.0 |
| Capitalization | 0.9 | 19.3 | 13.1 | 11.3 | 8.9 | 15.3 | 9.1 | 13.6 | 11.5 | |
| Punctuation | 6.5 | 3.7 | 4.9 | 7.1 | 9.1 | 7.9 | 7.8 | 15.0 | 7.8 | |
| Morphology | Derivation | 1.9 | 4.2 | 5.8 | 3.7 | 2.0 | 1.3 | 9.5 | 8.2 | 4.6 |
| Compound | 3.1 | 0.6 | 5.2 | 3.1 | 1.7 | 1.0 | 7.0 | 12.9 | 4.3 | |
| Syntax | Voice | 7.8 | 10.8 | 5.7 | 7.5 | 5.5 | 4.5 | 8.3 | 11.3 | 7.7 |
| Grammar | 8.0 | 15.9 | 10.5 | 8.3 | 3.5 | 5.3 | 10.8 | 12.4 | 9.3 | |
| Conjunction | 9.1 | 7.6 | 7.7 | 9.4 | 7.5 | 8.0 | 11.7 | 13.2 | 9.3 | |
| Semantics | Concept | 5.0 | 8.9 | 5.3 | 6.5 | 4.8 | 3.9 | 8.5 | 9.4 | 6.5 |
| Negation | 4.8 | 5.2 | 6.5 | 8.2 | 3.9 | 4.6 | 10.5 | 15.2 | 7.4 | |
| Discourse | Markers | 4.7 | 1.6 | 5.4 | 2.2 | 1.9 | 3.2 | 6.3 | 8.7 | 4.2 |
| Appraisal | 5.5 | 2.6 | 4.8 | 3.2 | 3.1 | 2.9 | 7.1 | 14.0 | 5.4 | |
| Varieties | Style | 11.8 | 12.2 | 13.1 | 3.6 | 4.8 | 6.4 | 7.4 | 10.9 | 8.8 |
| Dialect | 12.6 | 7.4 | 8.4 | 5.9 | 8.9 | 4.7 | 7.0 | 13.2 | 8.5 | |
| Average | 7.4 | 8.6 | 8.3 | 7.5 | 6.7 | 6.4 | 9.3 | 13.0 | 8.4 |
Geographical bias causes highest NER brittleness (25.8%). DeepSeek R1 shows highest unrobustness (13% avg).
| Category | Modification | BERT | GPT-2 | T5 | GPT-4o | Claude | Llama | GPT-5 | DS R1 | Avg |
|---|---|---|---|---|---|---|---|---|---|---|
| Bias | Temporal | 5.0 | 5.0 | 3.0 | 0.0 | 8.0 | 1.0 | 3.0 | 3.0 | 3.5 |
| Geographical | 5.0 | 4.0 | 3.0 | 9.0 | 8.0 | 7.0 | 4.0 | 8.0 | 6.0 | |
| Length | 3.0 | 2.0 | 3.0 | 3.0 | 9.0 | 2.0 | 1.0 | 1.0 | 3.0 | |
| Orthography | Spelling | 4.0 | 4.0 | 0.0 | 0.0 | 6.0 | 3.0 | 0.0 | 0.0 | 2.1 |
| Capitalization | 0.0 | 3.0 | 3.0 | 1.0 | 7.1 | 3.0 | 0.0 | 3.0 | 2.5 | |
| Punctuation | 1.0 | 2.0 | 0.0 | 1.0 | 3.0 | 3.0 | 1.0 | 1.0 | 1.5 | |
| Morphology | Derivation | 3.4 | 5.7 | 2.3 | 2.3 | 6.9 | 2.3 | 3.4 | 4.6 | 3.9 |
| Compound | 3.2 | 8.4 | 3.2 | 2.1 | 5.3 | 1.1 | 3.2 | 4.2 | 3.8 | |
| Syntax | Voice | 9.0 | 10.0 | 9.0 | 4.0 | 10.0 | 3.0 | 5.0 | 4.0 | 6.8 |
| Grammar | 4.5 | 12.1 | 3.0 | 4.5 | 7.6 | 4.5 | 4.5 | 4.5 | 5.7 | |
| Conjunction | 3.0 | 6.0 | 1.0 | 1.0 | 3.0 | 3.0 | 0.0 | 0.0 | 2.1 | |
| Semantics | Concept | 6.0 | 6.0 | 5.0 | 4.0 | 4.0 | 2.0 | 1.0 | 5.0 | 4.1 |
| Negation | 22.9 | 20.8 | 25.0 | 16.7 | 17.7 | 15.6 | 16.7 | 17.7 | 19.1 | |
| Discourse | Markers | 3.0 | 6.1 | 4.0 | 1.0 | 12.1 | 3.0 | 2.0 | 3.0 | 4.3 |
| Sentiment | 19.0 | 16.0 | 12.0 | 14.0 | 15.0 | 16.0 | 11.0 | 10.0 | 14.1 | |
| Varieties | Casual | 9.0 | 9.0 | 5.0 | 3.0 | 7.0 | 6.0 | 3.0 | 2.0 | 5.5 |
| Dialect | 9.8 | 9.8 | 7.8 | 4.9 | 6.9 | 5.9 | 2.9 | 3.9 | 6.5 | |
| Average | 6.5 | 7.6 | 5.3 | 4.2 | 8.0 | 4.8 | 3.6 | 4.4 | 5.6 |
Negation (19.1%) and Sentiment shifts (14.1%) cause most brittleness. GPT-5 shows best robustness (3.6%).
| Category | Modification | GPT-4o | Claude-3.5 | Llama 3.1 | GPT-5 | DS R1 | Avg |
|---|---|---|---|---|---|---|---|
| Bias | Temporal | 1.0 | 2.0 | 4.0 | 0.0 | 1.0 | 1.6 |
| Geographical | 5.0 | 5.0 | 7.0 | 1.0 | 2.0 | 4.0 | |
| Length | 4.0 | 2.0 | 2.0 | 0.0 | 2.0 | 2.0 | |
| Orthography | Spelling | 1.0 | 0.0 | 2.0 | 1.0 | 0.0 | 0.8 |
| Capitalization | 3.0 | 1.0 | 5.0 | 1.0 | 1.0 | 2.2 | |
| Punctuation | 0.0 | 0.0 | 5.0 | 1.0 | 1.0 | 1.4 | |
| Semantics | Concept | 1.0 | 0.0 | 5.0 | 2.0 | 1.0 | 1.8 |
| Negation | 15.0 | 15.0 | 18.0 | 7.0 | 9.0 | 12.8 | |
| Discourse | Appraisal | 0.0 | 0.0 | 1.0 | 1.0 | 1.0 | 0.6 |
| Varieties | Style | 4.0 | 3.0 | 6.0 | 4.0 | 3.0 | 4.0 |
| Dialect | 2.0 | 4.0 | 6.0 | 1.0 | 2.0 | 3.0 | |
| Syntax | Conjunction | 0.0 | 1.0 | 4.0 | 1.0 | 1.0 | 1.4 |
| Voice | 2.0 | 2.0 | 6.0 | 1.0 | 2.0 | 2.6 | |
| Average | 2.9 | 2.7 | 5.5 | 1.6 | 2.0 | 2.9 |
Math reasoning is most robust overall (2.9% avg). Negation remains the key challenge (12.8%). GPT-5 shows best performance (1.6%).
| Category | Modification | GPT-4o | Claude-3.5 | Llama 3.1 | GPT-5 | DS R1 | Avg |
|---|---|---|---|---|---|---|---|
| Bias | Temporal | 11.1 | 4.0 | 12.1 | 5.1 | 12.1 | 8.9 |
| Geographical | 9.0 | 12.0 | 14.0 | 9.0 | 10.0 | 10.8 | |
| Length | 11.0 | 10.0 | 18.0 | 6.0 | 16.0 | 12.2 | |
| Orthography | Capitalization | 8.1 | 5.1 | 10.1 | 9.1 | 13.1 | 9.1 |
| Punctuation | 7.1 | 3.0 | 13.1 | 8.1 | 12.1 | 8.7 | |
| Spelling | 3.1 | 4.1 | 7.2 | 6.2 | 8.2 | 5.8 | |
| Syntax | Conjunction | 11.0 | 7.0 | 6.0 | 8.0 | 11.0 | 8.6 |
| Voice | 9.0 | 9.0 | 10.0 | 6.0 | 15.0 | 9.8 | |
| Semantics | Concept | 9.1 | 7.1 | 11.1 | 7.1 | 7.1 | 8.3 |
| Negation | 24.0 | 22.0 | 24.0 | 23.0 | 23.0 | 23.2 | |
| Discourse | Appraisal | 7.1 | 4.1 | 11.2 | 7.1 | 6.1 | 7.1 |
| Varieties | Style | 18.0 | 7.0 | 10.0 | 9.0 | 11.0 | 11.0 |
| Dialect | 13.0 | 13.0 | 12.0 | 11.0 | 15.0 | 12.8 | |
| Average | 10.8 | 8.3 | 12.2 | 8.8 | 12.3 | 10.5 |
IFEval shows highest overall unrobustness (10.5% avg). Negation (23.2%) and Dialect (12.8%) cause most failures.
How does model size affect robustness? We compare Llama 3.1 at 8B, 70B, and 405B parameters on the GSM math task.
| Category | Modification | Llama 8B | Llama 70B | Llama 405B | Avg |
|---|---|---|---|---|---|
| Bias | Temporal | 8.0 | 5.0 | 4.0 | 5.7 |
| Geographical | 8.0 | 3.0 | 7.0 | 6.0 | |
| Length | 8.0 | 2.0 | 2.0 | 4.0 | |
| Orthography | Spelling | 10.0 | 2.0 | 2.0 | 4.7 |
| Capitalization | 11.0 | 2.0 | 5.0 | 6.0 | |
| Punctuation | 5.0 | 1.0 | 5.0 | 3.7 | |
| Semantics | Concept | 12.0 | 1.0 | 5.0 | 6.0 |
| Negation | 22.0 | 17.0 | 18.0 | 19.0 | |
| Discourse | Appraisal | 7.0 | 5.0 | 1.0 | 4.3 |
| Varieties | Style | 16.0 | 6.0 | 6.0 | 9.3 |
| Dialect | 10.0 | 6.0 | 6.0 | 7.3 | |
| Syntax | Conjunction | 12.0 | 2.0 | 4.0 | 6.0 |
| Voice | 11.0 | 2.0 | 6.0 | 6.3 | |
| Average | 10.8 | 4.2 | 5.5 | 6.8 |
Key finding: Scaling from 8B→70B dramatically improves robustness (10.8%→4.2%), but 70B→405B shows diminishing returns (4.2%→5.5%). Negation remains challenging across all scales (17-22%). Larger models achieve higher accuracy but don't always translate to better robustness.
Color legend: High ≥15% Medium 10-15% Low ≤2%
@article{otmakhova2025fluke,
title = {FLUKE: A Linguistically-Driven and Task-Agnostic
Framework for Robustness Evaluation},
author = {Otmakhova, Yulia and Truong, Hung Thinh and
Mahendra, Rahmad and Zhai, Zenan and Zhu, Rongxin
and Beck, Daniel and Lau, Jey Han},
journal = {arXiv preprint arXiv:2504.17311},
year = {2025}
}