StudentsProjects, experiments and public contributions
On this page
OverviewWhen language stops being literalBuilding a multilingual dataset at scaleMore than a dataset: one progressive benchmarkFour tasks, from recognition to interpretationWhat the benchmark experiments foundWhy figurative-language understanding mattersThe project is open: data, code, paper and demonstrations
DSTI TechBlog  /  Students
StudentsStudent workbench · public contribution

Building IdiomX: language beyond the literal

“Spill the tea” is not an instruction involving a drink. DSTI student Ayman Ali Sharara built IdiomX to test whether multilingual AI systems can detect idiomatic usage, retrieve figurative expressions and explain meaning across English, Arabic and French.

idiomxmultilingual-nlpfigurative-languagesemantic-retrievalenglish-arabic-frenchstudent-project

Idioms expose a gap between recognising words and understanding language. Their meanings are often non-compositional: knowing every individual word is not enough to recover what the expression means in context. A system can parse “break a leg” perfectly and still mistake encouragement for injury.

01When language stops being literal

Humans routinely distinguish literal and figurative readings from context. “She spilled the tea on the table” describes an accident; “she spilled the tea about the meeting” describes disclosure. That distinction remains difficult for natural-language systems because lexical overlap can be almost identical while the meaning changes completely.

The original project started from a practical limitation: many idiom resources are modest in scale, narrow in language coverage or designed around one isolated task. IdiomX instead treats idiom understanding as a sequence of related problems—from recognising figurative usage to retrieving and explaining meaning across languages.

The design question: can one reproducible resource support detection, semantic retrieval, cross-lingual alignment and interpretable meaning retrieval rather than evaluating each capability in isolation?
IdiomX project cover illustrating multilingual idiom understanding
IdiomX frames figurative-language understanding as a multilingual data, modelling and evaluation problem.

02Building a multilingual dataset at scale

The public release contains more than 190,000 contextual examples spanning over 12,000 idioms. English expressions are linked to Arabic and French semantic representations, together with idiomatic, literal and borderline usage labels and supporting linguistic metadata.

≈196Krows in the current full Hugging Face dataset
12K+unique idioms represented
EN · AR · FREnglish, Arabic and French semantic alignment
≈1.04reported sentence-reuse factor after cleaning

The construction process combines lexical resources, controlled generation and validation. Its modular structure matters as much as its size: the objective is to make every stage inspectable and repeatable rather than publishing an opaque final file.

1

Collection

Extract candidate idioms from sources including Wiktionary-derived data and WordNet, while extending coverage with modern and generated candidates.

2

Cleaning and normalisation

Filter noise, standardise expressions, remove duplicates and prepare consistent records for enrichment and evaluation.

3

Controlled LLM enrichment

Use GPT-4.1-mini to generate meanings, contextual examples and aligned English, Arabic and French semantic fields.

4

Structured validation

Combine semantic-similarity scoring, rule-based checks, deduplication and leakage-aware splits to support reliable benchmarking.

IdiomX data preparation pipeline
The data-preparation workflow moves from heterogeneous lexical sources to normalised, enriched and validated examples.

03More than a dataset: one progressive benchmark

IdiomX is organised as a progression. The first task asks whether a model recognises figurative usage. The later tasks ask whether it can retrieve an appropriate idiom from context, align meaning across languages and return an explanation that a person can inspect.

Full IdiomX dataset and benchmark pipeline
The complete workflow joins dataset construction, model training, retrieval benchmarking, multilingual interpretation and deployment-ready artefacts.

04Four tasks, from recognition to interpretation

Task 1

Idiom detection

Determine whether an expression is being used idiomatically or literally within its sentence.

Compared
TF-IDF with Logistic Regression, DistilBERT and RoBERTa
Reported best
RoBERTa
Capability
Contextual disambiguation
Task 2

Context-to-idiom retrieval

Given a contextual sentence, rank the idioms that best express its underlying figurative meaning.

Compared
Dense, lexical and hybrid retrieval with reranking
Reported best
Hybrid retrieval with a fine-tuned reranker
Capability
Semantic retrieval
Task 3

Arabic-to-English retrieval

Use an Arabic context to retrieve the corresponding English idiom, testing semantic alignment across languages.

Compared
Multilingual MiniLM, multilingual E5 and fine-tuned E5
Reported best
Fine-tuned E5
Capability
Cross-lingual alignment
Task 4

Idiom interpretation

Retrieve the canonical idiom and explain its meaning in English, Arabic and French.

Compared
Dense and hybrid retrieval, with and without reranking
Reported best
Hybrid retrieval with reranking
Capability
Explainable semantic grounding

05What the benchmark experiments found

The project reports that contextual transformers substantially improve idiom detection, while hybrid lexical–dense retrieval outperforms dense retrieval alone. Fine-tuning is particularly important for the Arabic-to-English task, where surface forms provide little direct lexical help.

TaskReported leading configurationMain result
DetectionRoBERTa92.6% accuracy · F1 0.926
Context → idiomHybrid retrieval + fine-tuned rerankerTop-1 88.5%
Arabic → English idiomFine-tuned E5Top-1 57.8%
InterpretationHybrid retrieval + rerankerTop-1 67.4%

The figures are not interchangeable: each task tests a different search space and difficulty. Their combined value is the progression from classification toward multilingual retrieval and interpretable output.

A Task 4 output is designed to be readable

InputSpill the tea
Canonical meaningReveal gossip or personal secrets
Multilingual outputEnglish, Arabic and French explanations

06Why figurative-language understanding matters

Idioms are not edge cases confined to dictionaries. They appear in conversation, support requests, social media, subtitles, teaching material and everyday instructions. Systems that interpret them literally can misunderstand intent even when every individual token is familiar.

Conversational AIChatbots and assistants that better recognise what users actually mean.
TranslationSystems that seek an equivalent meaning rather than translating word by word.
Language learningTools that retrieve explanations, contexts and cross-language equivalents.
Semantic searchRetrieval based on intended meaning rather than surface-form overlap.
Content analysisImproved treatment of slang, figurative language and context-dependent usage.
Human–robot interactionInterfaces that cope more reliably with natural, culturally situated speech.

07The project is open: data, code, paper and demonstrations

IdiomX separates the dataset-construction pipeline from the modelling and benchmark repository. This makes the provenance of the resource easier to inspect while keeping the task notebooks, trained artefacts and demonstrations organised around evaluation.

Ayman Ali Sharara

DSTI student in the MSc in Data Science & AI, studying through Online asynchronous. His work spans multilingual NLP, data engineering, retrieval systems and practical AI applications. IdiomX was developed as his Deep Learning with Python project.

Article adapted for the DSTI TechBlog from Ayman Sharara’s original student-project contribution and the project’s current public documentation. The writing and presentation have been revised while preserving the project’s methods, claims and reported results.