Idioms expose a gap between recognising words and understanding language. Their meanings are often non-compositional: knowing every individual word is not enough to recover what the expression means in context. A system can parse “break a leg” perfectly and still mistake encouragement for injury.

01When language stops being literal

Humans routinely distinguish literal and figurative readings from context. “She spilled the tea on the table” describes an accident; “she spilled the tea about the meeting” describes disclosure. That distinction remains difficult for natural-language systems because lexical overlap can be almost identical while the meaning changes completely.

The original project started from a practical limitation: many idiom resources are modest in scale, narrow in language coverage or designed around one isolated task. IdiomX instead treats idiom understanding as a sequence of related problems—from recognising figurative usage to retrieving and explaining meaning across languages.

The design question: can one reproducible resource support detection, semantic retrieval, cross-lingual alignment and interpretable meaning retrieval rather than evaluating each capability in isolation?

IdiomX project cover illustrating multilingual idiom understanding — IdiomX frames figurative-language understanding as a multilingual data, modelling and evaluation problem.

02Building a multilingual dataset at scale

The public release contains more than 190,000 contextual examples spanning over 12,000 idioms. English expressions are linked to Arabic and French semantic representations, together with idiomatic, literal and borderline usage labels and supporting linguistic metadata.

≈196Krows in the current full Hugging Face dataset

12K+unique idioms represented

EN · AR · FREnglish, Arabic and French semantic alignment

≈1.04reported sentence-reuse factor after cleaning

The construction process combines lexical resources, controlled generation and validation. Its modular structure matters as much as its size: the objective is to make every stage inspectable and repeatable rather than publishing an opaque final file.

Collection

Extract candidate idioms from sources including Wiktionary-derived data and WordNet, while extending coverage with modern and generated candidates.

Cleaning and normalisation

Filter noise, standardise expressions, remove duplicates and prepare consistent records for enrichment and evaluation.

Controlled LLM enrichment

Use GPT-4.1-mini to generate meanings, contextual examples and aligned English, Arabic and French semantic fields.

Structured validation

Combine semantic-similarity scoring, rule-based checks, deduplication and leakage-aware splits to support reliable benchmarking.

IdiomX data preparation pipeline — The data-preparation workflow moves from heterogeneous lexical sources to normalised, enriched and validated examples.

03More than a dataset: one progressive benchmark

IdiomX is organised as a progression. The first task asks whether a model recognises figurative usage. The later tasks ask whether it can retrieve an appropriate idiom from context, align meaning across languages and return an explanation that a person can inspect.

Full IdiomX dataset and benchmark pipeline — The complete workflow joins dataset construction, model training, retrieval benchmarking, multilingual interpretation and deployment-ready artefacts.

04Four tasks, from recognition to interpretation

Task 1

Idiom detection

Determine whether an expression is being used idiomatically or literally within its sentence.

Compared: TF-IDF with Logistic Regression, DistilBERT and RoBERTa
Reported best: RoBERTa
Capability: Contextual disambiguation

Task 2

Context-to-idiom retrieval

Given a contextual sentence, rank the idioms that best express its underlying figurative meaning.

Compared: Dense, lexical and hybrid retrieval with reranking
Reported best: Hybrid retrieval with a fine-tuned reranker
Capability: Semantic retrieval

Task 3

Arabic-to-English retrieval

Use an Arabic context to retrieve the corresponding English idiom, testing semantic alignment across languages.

Compared: Multilingual MiniLM, multilingual E5 and fine-tuned E5
Reported best: Fine-tuned E5
Capability: Cross-lingual alignment

Task 4

Idiom interpretation

Retrieve the canonical idiom and explain its meaning in English, Arabic and French.

Compared: Dense and hybrid retrieval, with and without reranking
Reported best: Hybrid retrieval with reranking
Capability: Explainable semantic grounding

05What the benchmark experiments found

The project reports that contextual transformers substantially improve idiom detection, while hybrid lexical–dense retrieval outperforms dense retrieval alone. Fine-tuning is particularly important for the Arabic-to-English task, where surface forms provide little direct lexical help.

Task	Reported leading configuration	Main result
Detection	RoBERTa	92.6% accuracy · F1 0.926
Context → idiom	Hybrid retrieval + fine-tuned reranker	Top-1 88.5%
Arabic → English idiom	Fine-tuned E5	Top-1 57.8%
Interpretation	Hybrid retrieval + reranker	Top-1 67.4%

The figures are not interchangeable: each task tests a different search space and difficulty. Their combined value is the progression from classification toward multilingual retrieval and interpretable output.

A Task 4 output is designed to be readable

InputSpill the tea

Canonical meaningReveal gossip or personal secrets

Multilingual outputEnglish, Arabic and French explanations

06Why figurative-language understanding matters

Idioms are not edge cases confined to dictionaries. They appear in conversation, support requests, social media, subtitles, teaching material and everyday instructions. Systems that interpret them literally can misunderstand intent even when every individual token is familiar.

Conversational AIChatbots and assistants that better recognise what users actually mean.

TranslationSystems that seek an equivalent meaning rather than translating word by word.

Language learningTools that retrieve explanations, contexts and cross-language equivalents.

Semantic searchRetrieval based on intended meaning rather than surface-form overlap.

Content analysisImproved treatment of slang, figurative language and context-dependent usage.

Human–robot interactionInterfaces that cope more reliably with natural, culturally situated speech.

07The project is open: data, code, paper and demonstrations

IdiomX separates the dataset-construction pipeline from the modelling and benchmark repository. This makes the provenance of the resource easier to inspect while keeping the task notebooks, trained artefacts and demonstrations organised around evaluation.

Ayman Ali Sharara

DSTI student in the MSc in Data Science & AI, studying through Online asynchronous. His work spans multilingual NLP, data engineering, retrieval systems and practical AI applications. IdiomX was developed as his Deep Learning with Python project.

MSc in Data Science & AI Online asynchronous

LinkedIn GitHub Hugging Face

Article adapted for the DSTI TechBlog from Ayman Sharara’s original student-project contribution and the project’s current public documentation. The writing and presentation have been revised while preserving the project’s methods, claims and reported results.