ArabicNLPWorld – Arabic MSA, Dialects & Low‑Resource NLP Research Hub

ArabicNLPWorld is a research organization dedicated to natural language processing for Modern Standard Arabic (MSA) — a well‑resourced language — as well as under‑resourced Arabic dialects, low‑resource language pairs involving Arabic, Islamic religious texts, and Arabic–Russian translation. We develop and share open‑source models, datasets, and educational tools to bridge the digital divide across all varieties and modalities of Arabic.
📌 This is an organization card. Our models, datasets, and demos are available on our Hugging Face Organization Page.
🎯 Our Mission
- Build state‑of‑the‑art language models for Modern Standard Arabic (MSA) — leveraging its rich existing resources.
- Create resources and models for under‑resourced Arabic dialects (Egyptian, Levantine, Gulf, Maghrebi, Sudanese, etc.).
- Advance Arabic–Russian machine translation using our 15.8M parallel corpus.
- Support low‑resource language pairs where Arabic is one side (e.g., Arabic ↔ Tatar, Arabic ↔ Chechen, Arabic ↔ Bashkir, Arabic ↔ Hausa, Arabic ↔ Somali).
- Develop specialised NLP tools for Islamic religious texts:
- The Quran with Russian translation (Elmir Kuliev)
- Sahih al-Bukhari — the most authentic hadith collection
- Sahih Muslim — the second most authentic collection
- 40 Hadith of al-Nawawi (41 in some editions)
- Kutub al-Sittah (The Six Major Hadith Collections) — including Sunan Abu Dawud, Jami` at-Tirmidhi, Sunan an-Nasa'i, and Sunan Ibn Majah
- Foster a community of researchers, developers, native speakers, dialect speakers, and Islamic scholars working together on inclusive Arabic NLP.
🧠 Clarification: MSA vs. Dialects vs. Low‑Resource
| Variety / Pair |
Resource Status |
Description |
| Modern Standard Arabic (MSA) |
✅ Well‑resourced |
Hundreds of billions of tokens, many pretrained models (AraBERT, MARBERT, AraT5, CAMeLBERT), large parallel corpora with English and other major languages. |
| Arabic dialects (Egyptian, Levantine, Gulf, Maghrebi, etc.) |
⚠️ Under‑resourced to low‑resource |
Limited annotated data, few pretrained models, scarce parallel corpora with MSA or English. Egyptian is best‑resourced among dialects but still far behind MSA. |
| Arabic ↔ Russian translation |
🔄 Mid‑resource |
Our 15.8M corpus is the largest publicly available for this pair, but still modest compared to English‑Arabic (100M+). |
| Low‑resource pairs (Arabic ↔ Turkic, Caucasian, African languages) |
❌ Low‑resource |
Very few (often zero) parallel datasets; requires transfer learning, data augmentation, and zero‑shot techniques. |
| Islamic religious texts |
📖 Domain‑specific |
Rich but specialised vocabulary (classical Arabic). Includes Quran, Sahih al-Bukhari, Sahih Muslim, 40 Hadith of al-Nawawi, and Kutub al-Sittah with curated parallel translations. |
🚀 Interactive Demos
Explore our live Hugging Face Spaces and try out our models directly in your browser:
🔤 Language Models
🌐 Machine Translation
📚 Linguistic Tools
📊 Data & Benchmarks
Click on any demo to start experimenting – no installation required!
🧠 Research Focus Areas
🇸🇦 Modern Standard Arabic (MSA) – Well‑Resourced
- Continued pretraining and fine‑tuning of MSA models (AraBERT, AraT5, MARBERT)
- Benchmarking on standard tasks (POS, NER, sentiment, QA)
- Leveraging MSA as a source for transfer learning to dialects
🗣️ Arabic Dialects – Under‑Resourced to Low‑Resource
Focus on: Egyptian (arz), Levantine (apc), Gulf (afb), Maghrebi (ary), Sudanese (apd)
Challenges we address:
- Lack of annotated data → data augmentation, semi‑supervised learning
- Few parallel corpora (dialect ↔ MSA, dialect ↔ English)
- Absence of dialect‑specific pretrained models
Our approach:
- Cross‑lingual transfer from MSA to dialects
- Few‑shot and zero‑shot learning for dialect tasks
- Crowdsourced annotation and validation with native speakers
🔄 Arabic–Russian Bilingual NLP – Mid‑Resource
- 15,801,992 parallel sentences (our flagship corpus)
- Sources: OPUS, TED, Baranov dictionary, Borisov dictionary, Sahih al-Bukhari, Sahih Muslim, 40 Hadith, Quran (Kuliev), phrasebook, Tatoeba
- Length correlation: 0.925
- Applications: translation, cross‑lingual retrieval, bilingual lexicography
🌍 Low‑Resource Pairs Involving Arabic – Low‑Resource
We focus on language pairs with minimal or no parallel data:
| Pair |
Resource Status |
Our Work |
| Arabic ↔ Tatar |
Very low |
Data collection, transfer learning from Arabic–Russian + Russian–Tatar |
| Arabic ↔ Chechen |
Extremely low |
Zero‑shot translation via English or Russian pivot |
| Arabic ↔ Bashkir |
Extremely low |
Cross‑lingual embeddings |
| Arabic ↔ Hausa |
Very low |
Leveraging NLLB model |
| Arabic ↔ Somali |
Very low |
Data collection and annotation |
🕌 Islamic Religious Texts – Domain‑Specific
We provide digitised, aligned, and machine‑readable versions of major Islamic texts:
| Text |
Description |
Parallel Translation |
| The Quran |
The holy book of Islam, 114 surahs |
Russian (Elmir Kuliev), English (Sahih International) |
| Sahih al-Bukhari |
Most authentic hadith collection (c. 7,000+ hadith) |
Russian translation |
| Sahih Muslim |
Second most authentic collection (c. 7,000+ hadith) |
Russian translation |
| 40 Hadith of al-Nawawi |
Concise collection of 40 (or 41) essential hadith |
Russian translation |
| Sunan Abu Dawud |
One of the six major collections (Kutub al-Sittah) |
Russian (in progress) |
| Jami` at-Tirmidhi |
One of the six major collections |
Russian (in progress) |
| Sunan an-Nasa'i |
One of the six major collections |
Russian (in progress) |
| Sunan Ibn Majah |
One of the six major collections |
Russian (in progress) |
Applications:
- Semantic search over hadith corpora
- Question answering on Islamic texts
- Classical Arabic morphological analysis
- Cross‑collection hadith matching (e.g., finding the same hadith in Bukhari and Muslim)
- Alignment of multiple translations for linguistic study
📖 Lexicographic Resources
- Arabic‑Russian Dictionary – Kh.K. Baranov (latest edition) – digitised and aligned
- Russian‑Arabic Dictionary – V.M. Borisov (latest edition) – bidirectional coverage
- Machine‑readable formats for NLP integration
📚 Educational Resources
We believe in open education and reproducible research. All our tutorials and teaching materials are freely available.
- Interactive Notebooks – Arabic NLP, dialect processing, Arabic–Russian MT, low‑resource techniques (in Python, using Hugging Face libraries)
- Video Lectures – Recorded talks on Arabic morphology, dialect identification, and Islamic text processing
- Course Materials – Slides, readings, and assignments from our university courses
- Blog Posts – Deep dives into challenges and solutions for Arabic dialects and low‑resource pairs
🤝 Get Involved
We welcome contributions from the community – researchers, developers, students, native speakers, dialect speakers, and Islamic scholars.
For Researchers
- Use our models and datasets (and cite us!)
- Collaborate on dialect annotation or low‑resource pair projects
- Contribute new benchmarks for dialects or Arabic–Russian MT
For Developers
- Integrate our models into translation, search, or chatbot applications
- Report bugs or suggest improvements via GitHub Issues
- Submit pull requests to our open‑source repositories
For Native & Dialect Speakers
- Help us validate dialect annotations and translations
- Share dialect texts (with permission) to enrich our data
- Provide feedback on model outputs to reduce errors
For Islamic Scholars & Students
- Help verify Quranic verse alignments and hadith translations
- Suggest improvements for religious text processing
- Use our tools for digital Islamic studies
For Students
- Use our demos and tutorials for learning
- Participate in our mentorship program or summer schools
- Start your own research project with our support
📊 Corpus Highlights
Our flagship resource – the Arabic–Russian Translation Corpus:
| Statistic |
Value |
| Total pairs |
15,801,992 |
| Length correlation |
0.925 |
| Arabic tokens |
357.7M |
| Russian tokens |
366.0M |
| Unique Arabic tokens |
1,848,317 |
| Unique Russian tokens |
933,467 |
| Sources |
OPUS, TED, Baranov, Borisov, Sahih al-Bukhari, Sahih Muslim, 40 Hadith, Quran (Kuliev), phrasebook, Tatoeba |
Most frequent Arabic words: في (13.68M), من (8.45M), على (5.59M)
Most frequent Russian words: и (15.88M), в (15.52M), по (5.38M)
🌐 Connect With Us
🔄 Ecosystem Integration
Our work is integrated with the broader Hugging Face ecosystem:
- Models on the Hub with easy‑to‑use pipelines
- Datasets with streaming and evaluation scripts
- Spaces for interactive demos and educational tools
- Gradio apps for user‑friendly interfaces
Empowering Arabic MSA, dialects, low‑resource pairs, and Islamic texts through open science and community collaboration.

© 2026 ArabicNLPWorld – Open science for Arabic, dialects, low‑resource pairs, and beyond.