Idioms expose a gap between recognising words and understanding language. Their meanings are often non-compositional: knowing every individual word is not enough to recover what the expression means in context. A system can parse “break a leg” perfectly and still mistake encouragement for injury.
01When language stops being literal
Humans routinely distinguish literal and figurative readings from context. “She spilled the tea on the table” describes an accident; “she spilled the tea about the meeting” describes disclosure. That distinction remains difficult for natural-language systems because lexical overlap can be almost identical while the meaning changes completely.
The original project started from a practical limitation: many idiom resources are modest in scale, narrow in language coverage or designed around one isolated task. IdiomX instead treats idiom understanding as a sequence of related problems—from recognising figurative usage to retrieving and explaining meaning across languages.

02Building a multilingual dataset at scale
The public release contains more than 190,000 contextual examples spanning over 12,000 idioms. English expressions are linked to Arabic and French semantic representations, together with idiomatic, literal and borderline usage labels and supporting linguistic metadata.
The construction process combines lexical resources, controlled generation and validation. Its modular structure matters as much as its size: the objective is to make every stage inspectable and repeatable rather than publishing an opaque final file.
Collection
Extract candidate idioms from sources including Wiktionary-derived data and WordNet, while extending coverage with modern and generated candidates.
Cleaning and normalisation
Filter noise, standardise expressions, remove duplicates and prepare consistent records for enrichment and evaluation.
Controlled LLM enrichment
Use GPT-4.1-mini to generate meanings, contextual examples and aligned English, Arabic and French semantic fields.
Structured validation
Combine semantic-similarity scoring, rule-based checks, deduplication and leakage-aware splits to support reliable benchmarking.

03More than a dataset: one progressive benchmark
IdiomX is organised as a progression. The first task asks whether a model recognises figurative usage. The later tasks ask whether it can retrieve an appropriate idiom from context, align meaning across languages and return an explanation that a person can inspect.

04Four tasks, from recognition to interpretation
Idiom detection
Determine whether an expression is being used idiomatically or literally within its sentence.
- Compared
- TF-IDF with Logistic Regression, DistilBERT and RoBERTa
- Reported best
- RoBERTa
- Capability
- Contextual disambiguation
Context-to-idiom retrieval
Given a contextual sentence, rank the idioms that best express its underlying figurative meaning.
- Compared
- Dense, lexical and hybrid retrieval with reranking
- Reported best
- Hybrid retrieval with a fine-tuned reranker
- Capability
- Semantic retrieval
Arabic-to-English retrieval
Use an Arabic context to retrieve the corresponding English idiom, testing semantic alignment across languages.
- Compared
- Multilingual MiniLM, multilingual E5 and fine-tuned E5
- Reported best
- Fine-tuned E5
- Capability
- Cross-lingual alignment
Idiom interpretation
Retrieve the canonical idiom and explain its meaning in English, Arabic and French.
- Compared
- Dense and hybrid retrieval, with and without reranking
- Reported best
- Hybrid retrieval with reranking
- Capability
- Explainable semantic grounding
05What the benchmark experiments found
The project reports that contextual transformers substantially improve idiom detection, while hybrid lexical–dense retrieval outperforms dense retrieval alone. Fine-tuning is particularly important for the Arabic-to-English task, where surface forms provide little direct lexical help.
| Task | Reported leading configuration | Main result |
|---|---|---|
| Detection | RoBERTa | 92.6% accuracy · F1 0.926 |
| Context → idiom | Hybrid retrieval + fine-tuned reranker | Top-1 88.5% |
| Arabic → English idiom | Fine-tuned E5 | Top-1 57.8% |
| Interpretation | Hybrid retrieval + reranker | Top-1 67.4% |
The figures are not interchangeable: each task tests a different search space and difficulty. Their combined value is the progression from classification toward multilingual retrieval and interpretable output.
A Task 4 output is designed to be readable
06Why figurative-language understanding matters
Idioms are not edge cases confined to dictionaries. They appear in conversation, support requests, social media, subtitles, teaching material and everyday instructions. Systems that interpret them literally can misunderstand intent even when every individual token is familiar.
07The project is open: data, code, paper and demonstrations
IdiomX separates the dataset-construction pipeline from the modelling and benchmark repository. This makes the provenance of the resource easier to inspect while keeping the task notebooks, trained artefacts and demonstrations organised around evaluation.
Ayman Ali Sharara
DSTI student in the MSc in Data Science & AI, studying through Online asynchronous. His work spans multilingual NLP, data engineering, retrieval systems and practical AI applications. IdiomX was developed as his Deep Learning with Python project.
Article adapted for the DSTI TechBlog from Ayman Sharara’s original student-project contribution and the project’s current public documentation. The writing and presentation have been revised while preserving the project’s methods, claims and reported results.