# Building IdiomX: language beyond the literal

Canonical HTML: https://dsti.school/techblog/idiomx-multilingual-idiom-understanding

This Markdown copy is generated from the same DSTI static-site build as the canonical HTML page. It is intended for machine readability and concise retrieval.

[DSTI TechBlog](https://dsti.school/techblog)  /  Students

Students Student workbench · public contribution

“Spill the tea” is not an instruction involving a drink. DSTI student Ayman Ali Sharara built IdiomX to test whether multilingual AI systems can detect idiomatic usage, retrieve figurative expressions and explain meaning across English, Arabic and French.

AS Ayman Ali Sharara DSTI student · MSc in Data Science & AI · Online asynchronous

26 May 2026 13 min read Deep Learning with Python project

idiomx multilingual-nlp figurative-language semantic-retrieval english-arabic-french student-project

## One expression. Meaning beyond the words.

Observed expression “Spill the tea”
Idiomatic interpretation

EN Reveal gossip or personal secrets.

AR كشف الشائعات أو الأسرار

FR Révéler des potins.

190K+ contextual examples

12K+ idioms

3 aligned languages

4 benchmark tasks

Idioms expose a gap between recognising words and understanding language. Their meanings are often non-compositional: knowing every individual word is not enough to recover what the expression means in context. A system can parse “break a leg” perfectly and still mistake encouragement for injury.

AS
A student project built for public scrutiny IdiomX began within the DSTI course Deep Learning with Python , supervised by Pr Hanna Abi Akl. Ayman completed the work as a student in the [MSc in Data Science & AI](https://dsti.school/msc-in-data-science-and-ai), studying through the [Online asynchronous](https://dsti.school/online-studies) mode. The resulting dataset, construction pipeline, models, notebooks, paper and demonstrations are publicly accessible.

## 01 When language stops being literal

Humans routinely distinguish literal and figurative readings from context. “She spilled the tea on the table” describes an accident; “she spilled the tea about the meeting” describes disclosure. That distinction remains difficult for natural-language systems because lexical overlap can be almost identical while the meaning changes completely.

The original project started from a practical limitation: many idiom resources are modest in scale, narrow in language coverage or designed around one isolated task. IdiomX instead treats idiom understanding as a sequence of related problems—from recognising figurative usage to retrieving and explaining meaning across languages.

The design question: can one reproducible resource support detection, semantic retrieval, cross-lingual alignment and interpretable meaning retrieval rather than evaluating each capability in isolation?

![IdiomX project cover illustrating multilingual idiom understanding](https://media.dsti.school/wp-content/uploads/2026/05/25102835/IdiomX_Cover.avif)

> **Figure caption:** IdiomX frames figurative-language understanding as a multilingual data, modelling and evaluation problem.

## 02 Building a multilingual dataset at scale

The public release contains more than 190,000 contextual examples spanning over 12,000 idioms. English expressions are linked to Arabic and French semantic representations, together with idiomatic, literal and borderline usage labels and supporting linguistic metadata.

≈196K rows in the current full Hugging Face dataset

12K+ unique idioms represented

EN · AR · FR English, Arabic and French semantic alignment

≈1.04 reported sentence-reuse factor after cleaning

The construction process combines lexical resources, controlled generation and validation. Its modular structure matters as much as its size: the objective is to make every stage inspectable and repeatable rather than publishing an opaque final file.

1

### Collection

Extract candidate idioms from sources including Wiktionary-derived data and WordNet, while extending coverage with modern and generated candidates.

2

### Cleaning and normalisation

Filter noise, standardise expressions, remove duplicates and prepare consistent records for enrichment and evaluation.

3

### Controlled LLM enrichment

Use GPT-4.1-mini to generate meanings, contextual examples and aligned English, Arabic and French semantic fields.

4

### Structured validation

Combine semantic-similarity scoring, rule-based checks, deduplication and leakage-aware splits to support reliable benchmarking.

![IdiomX data preparation pipeline](https://media.dsti.school/wp-content/uploads/2026/05/25135940/IdiomX_Data_perep_Pipeline_v2.avif)

> **Figure caption:** The data-preparation workflow moves from heterogeneous lexical sources to normalised, enriched and validated examples.

## 03 More than a dataset: one progressive benchmark

IdiomX is organised as a progression. The first task asks whether a model recognises figurative usage. The later tasks ask whether it can retrieve an appropriate idiom from context, align meaning across languages and return an explanation that a person can inspect.

![Full IdiomX dataset and benchmark pipeline](https://media.dsti.school/wp-content/uploads/2026/05/25140144/IdiomX_full_pipeline_V1.avif)

> **Figure caption:** The complete workflow joins dataset construction, model training, retrieval benchmarking, multilingual interpretation and deployment-ready artefacts.

## 04 Four tasks, from recognition to interpretation

Task 1

### Idiom detection

Determine whether an expression is being used idiomatically or literally within its sentence.

Compared

TF-IDF with Logistic Regression, DistilBERT and RoBERTa

Reported best

RoBERTa

Capability

Contextual disambiguation

Task 2

### Context-to-idiom retrieval

Given a contextual sentence, rank the idioms that best express its underlying figurative meaning.

Compared

Dense, lexical and hybrid retrieval with reranking

Reported best

Hybrid retrieval with a fine-tuned reranker

Capability

Semantic retrieval

Task 3

### Arabic-to-English retrieval

Use an Arabic context to retrieve the corresponding English idiom, testing semantic alignment across languages.

Compared

Multilingual MiniLM, multilingual E5 and fine-tuned E5

Reported best

Fine-tuned E5

Capability

Cross-lingual alignment

Task 4

### Idiom interpretation

Retrieve the canonical idiom and explain its meaning in English, Arabic and French.

Compared

Dense and hybrid retrieval, with and without reranking

Reported best

Hybrid retrieval with reranking

Capability

Explainable semantic grounding

## 05 What the benchmark experiments found

The project reports that contextual transformers substantially improve idiom detection, while hybrid lexical–dense retrieval outperforms dense retrieval alone. Fine-tuning is particularly important for the Arabic-to-English task, where surface forms provide little direct lexical help.

Task | Reported leading configuration | Main result

Detection | RoBERTa | 92.6% accuracy · F1 0.926

Context → idiom | Hybrid retrieval + fine-tuned reranker | Top-1 88.5%

Arabic → English idiom | Fine-tuned E5 | Top-1 57.8%

Interpretation | Hybrid retrieval + reranker | Top-1 67.4%

The figures are not interchangeable: each task tests a different search space and difficulty. Their combined value is the progression from classification toward multilingual retrieval and interpretable output.

### A Task 4 output is designed to be readable

Input Spill the tea

Canonical meaning Reveal gossip or personal secrets

Multilingual output English, Arabic and French explanations

## 06 Why figurative-language understanding matters

Idioms are not edge cases confined to dictionaries. They appear in conversation, support requests, social media, subtitles, teaching material and everyday instructions. Systems that interpret them literally can misunderstand intent even when every individual token is familiar.

Conversational AI Chatbots and assistants that better recognise what users actually mean.

Translation Systems that seek an equivalent meaning rather than translating word by word.

Language learning Tools that retrieve explanations, contexts and cross-language equivalents.

Semantic search Retrieval based on intended meaning rather than surface-form overlap.

Content analysis Improved treatment of slang, figurative language and context-dependent usage.

Human–robot interaction Interfaces that cope more reliably with natural, culturally situated speech.

### What remains difficult

- Some examples are LLM-generated, so controlled enrichment does not remove the need for critical review.
- Idiomatic interpretation can legitimately vary by context, culture and register.
- Open-ended inputs may retrieve a related idiom rather than the exact intended expression.
- Cross-lingual retrieval remains materially harder than monolingual detection.

## 07 The project is open: data, code, paper and demonstrations

IdiomX separates the dataset-construction pipeline from the modelling and benchmark repository. This makes the provenance of the resource easier to inspect while keeping the task notebooks, trained artefacts and demonstrations organised around evaluation.

AS

### Ayman Ali Sharara

DSTI student in the MSc in Data Science & AI, studying through Online asynchronous. His work spans multilingual NLP, data engineering, retrieval systems and practical AI applications. IdiomX was developed as his Deep Learning with Python project.

[MSc in Data Science & AI](https://dsti.school/msc-in-data-science-and-ai)[Online asynchronous](https://dsti.school/online-studies)

[LinkedIn](https://www.linkedin.com/in/ayman-sharara/)[GitHub](https://github.com/aymanshar)[Hugging Face](https://huggingface.co/aymansharara)

Article adapted for the DSTI TechBlog from Ayman Sharara’s original student-project contribution and the project’s current public documentation. The writing and presentation have been revised while preserving the project’s methods, claims and reported results.
