ArabicNLPWorld – Arabic MSA, Dialects & Low‑Resource NLP Research Hub

Status Focus Focus Focus Focus Focus

ArabicNLPWorld is a research organization dedicated to natural language processing for Modern Standard Arabic (MSA) — a well‑resourced language — as well as under‑resourced Arabic dialects, low‑resource language pairs involving Arabic, Islamic religious texts, and Arabic–Russian translation. We develop and share open‑source models, datasets, and educational tools to bridge the digital divide across all varieties and modalities of Arabic.

📌 This is an organization card. Our models, datasets, and demos are available on our Hugging Face Organization Page.


🎯 Our Mission


🧠 Clarification: MSA vs. Dialects vs. Low‑Resource

Variety / Pair Resource Status Description
Modern Standard Arabic (MSA) Well‑resourced Hundreds of billions of tokens, many pretrained models (AraBERT, MARBERT, AraT5, CAMeLBERT), large parallel corpora with English and other major languages.
Arabic dialects (Egyptian, Levantine, Gulf, Maghrebi, etc.) ⚠️ Under‑resourced to low‑resource Limited annotated data, few pretrained models, scarce parallel corpora with MSA or English. Egyptian is best‑resourced among dialects but still far behind MSA.
Arabic ↔ Russian translation 🔄 Mid‑resource Our 15.8M corpus is the largest publicly available for this pair, but still modest compared to English‑Arabic (100M+).
Low‑resource pairs (Arabic ↔ Turkic, Caucasian, African languages) Low‑resource Very few (often zero) parallel datasets; requires transfer learning, data augmentation, and zero‑shot techniques.
Islamic religious texts 📖 Domain‑specific Rich but specialised vocabulary (classical Arabic). Includes Quran, Sahih al-Bukhari, Sahih Muslim, 40 Hadith of al-Nawawi, and Kutub al-Sittah with curated parallel translations.

🚀 Interactive Demos

Explore our live Hugging Face Spaces and try out our models directly in your browser:

🔤 Language Models

🌐 Machine Translation

📚 Linguistic Tools

📊 Data & Benchmarks

Click on any demo to start experimenting – no installation required!


🧠 Research Focus Areas

🇸🇦 Modern Standard Arabic (MSA) – Well‑Resourced

🗣️ Arabic Dialects – Under‑Resourced to Low‑Resource

Focus on: Egyptian (arz), Levantine (apc), Gulf (afb), Maghrebi (ary), Sudanese (apd)

Challenges we address:

Our approach:

🔄 Arabic–Russian Bilingual NLP – Mid‑Resource

🌍 Low‑Resource Pairs Involving Arabic – Low‑Resource

We focus on language pairs with minimal or no parallel data:

Pair Resource Status Our Work
Arabic ↔ Tatar Very low Data collection, transfer learning from Arabic–Russian + Russian–Tatar
Arabic ↔ Chechen Extremely low Zero‑shot translation via English or Russian pivot
Arabic ↔ Bashkir Extremely low Cross‑lingual embeddings
Arabic ↔ Hausa Very low Leveraging NLLB model
Arabic ↔ Somali Very low Data collection and annotation

🕌 Islamic Religious Texts – Domain‑Specific

We provide digitised, aligned, and machine‑readable versions of major Islamic texts:

Text Description Parallel Translation
The Quran The holy book of Islam, 114 surahs Russian (Elmir Kuliev), English (Sahih International)
Sahih al-Bukhari Most authentic hadith collection (c. 7,000+ hadith) Russian translation
Sahih Muslim Second most authentic collection (c. 7,000+ hadith) Russian translation
40 Hadith of al-Nawawi Concise collection of 40 (or 41) essential hadith Russian translation
Sunan Abu Dawud One of the six major collections (Kutub al-Sittah) Russian (in progress)
Jami` at-Tirmidhi One of the six major collections Russian (in progress)
Sunan an-Nasa'i One of the six major collections Russian (in progress)
Sunan Ibn Majah One of the six major collections Russian (in progress)

Applications:

📖 Lexicographic Resources


📚 Educational Resources

We believe in open education and reproducible research. All our tutorials and teaching materials are freely available.


🤝 Get Involved

We welcome contributions from the community – researchers, developers, students, native speakers, dialect speakers, and Islamic scholars.

For Researchers

For Developers

For Native & Dialect Speakers

For Islamic Scholars & Students

For Students


📊 Corpus Highlights

Our flagship resource – the Arabic–Russian Translation Corpus:

Statistic Value
Total pairs 15,801,992
Length correlation 0.925
Arabic tokens 357.7M
Russian tokens 366.0M
Unique Arabic tokens 1,848,317
Unique Russian tokens 933,467
Sources OPUS, TED, Baranov, Borisov, Sahih al-Bukhari, Sahih Muslim, 40 Hadith, Quran (Kuliev), phrasebook, Tatoeba

Most frequent Arabic words: في (13.68M), من (8.45M), على (5.59M)

Most frequent Russian words: и (15.88M), в (15.52M), по (5.38M)


🌐 Connect With Us


🔄 Ecosystem Integration

Our work is integrated with the broader Hugging Face ecosystem:


Empowering Arabic MSA, dialects, low‑resource pairs, and Islamic texts through open science and community collaboration.

Hugging Face GitHub Dataset

© 2026 ArabicNLPWorld – Open science for Arabic, dialects, low‑resource pairs, and beyond.