ArabicNLPWorld – Arabic MSA, Dialects & Low‑Resource NLP Research Hub

ArabicNLPWorld is a research organization dedicated to natural language processing for Modern Standard Arabic (MSA) — a well‑resourced language — as well as under‑resourced Arabic dialects, low‑resource language pairs involving Arabic, Islamic religious texts, and Arabic–Russian translation. We develop and share open‑source models, datasets, and educational tools to bridge the digital divide across all varieties and modalities of Arabic.

📌 This is an organization card. Our models, datasets, and demos are available on our Hugging Face Organization Page.

🎯 Our Mission

Build state‑of‑the‑art language models for Modern Standard Arabic (MSA) — leveraging its rich existing resources.
Create resources and models for under‑resourced Arabic dialects (Egyptian, Levantine, Gulf, Maghrebi, Sudanese, etc.).
Advance Arabic–Russian machine translation using our 15.8M parallel corpus.
Support low‑resource language pairs where Arabic is one side (e.g., Arabic ↔ Tatar, Arabic ↔ Chechen, Arabic ↔ Bashkir, Arabic ↔ Hausa, Arabic ↔ Somali).
Develop specialised NLP tools for Islamic religious texts:
- The Quran with Russian translation (Elmir Kuliev)
- Sahih al-Bukhari — the most authentic hadith collection
- Sahih Muslim — the second most authentic collection
- 40 Hadith of al-Nawawi (41 in some editions)
- Kutub al-Sittah (The Six Major Hadith Collections) — including Sunan Abu Dawud, Jami` at-Tirmidhi, Sunan an-Nasa'i, and Sunan Ibn Majah
Foster a community of researchers, developers, native speakers, dialect speakers, and Islamic scholars working together on inclusive Arabic NLP.

🧠 Clarification: MSA vs. Dialects vs. Low‑Resource

Variety / Pair	Resource Status	Description
Modern Standard Arabic (MSA)	✅ Well‑resourced	Hundreds of billions of tokens, many pretrained models (AraBERT, MARBERT, AraT5, CAMeLBERT), large parallel corpora with English and other major languages.
Arabic dialects (Egyptian, Levantine, Gulf, Maghrebi, etc.)	⚠️ Under‑resourced to low‑resource	Limited annotated data, few pretrained models, scarce parallel corpora with MSA or English. Egyptian is best‑resourced among dialects but still far behind MSA.
Arabic ↔ Russian translation	🔄 Mid‑resource	Our 15.8M corpus is the largest publicly available for this pair, but still modest compared to English‑Arabic (100M+).
Low‑resource pairs (Arabic ↔ Turkic, Caucasian, African languages)	❌ Low‑resource	Very few (often zero) parallel datasets; requires transfer learning, data augmentation, and zero‑shot techniques.
Islamic religious texts	📖 Domain‑specific	Rich but specialised vocabulary (classical Arabic). Includes Quran, Sahih al-Bukhari, Sahih Muslim, 40 Hadith of al-Nawawi, and Kutub al-Sittah with curated parallel translations.

🚀 Interactive Demos

Explore our live Hugging Face Spaces and try out our models directly in your browser:

🔤 Language Models

AraBERT Playground – Generate and analyze MSA text.
DialectBERT Explorer – Pretrained model for Egyptian, Levantine, and Gulf Arabic.
Arabic–Russian Embeddings – Cross‑lingual word vectors for translation.

🌐 Machine Translation

Arabic ↔ Russian Translator – Neural translation demo (15.8M parallel pairs).
MSA ↔ Dialect Translator – Convert between Modern Standard Arabic and Egyptian/Levantine.
Quran & Hadith Translation Explorer – Arabic originals with Russian (Kuliev) and English parallels.

📚 Linguistic Tools

Arabic Morphological Analyzer – Root‑based segmentation and POS tagging.
Dialect Identifier – Detect MSA vs. Egyptian, Levantine, Gulf, Maghrebi.
Named Entity Recognition for Arabic – Identify persons, locations, organizations.

📊 Data & Benchmarks

Arabic–Russian Corpus Explorer – Browse 15.8M parallel sentences.
Dialect NLP Leaderboard – Compare model performance on dialect tasks.
Islamic Text Annotation Tool – Help us improve Quran/hadith alignments.

Click on any demo to start experimenting – no installation required!

🧠 Research Focus Areas

🇸🇦 Modern Standard Arabic (MSA) – Well‑Resourced

Continued pretraining and fine‑tuning of MSA models (AraBERT, AraT5, MARBERT)
Benchmarking on standard tasks (POS, NER, sentiment, QA)
Leveraging MSA as a source for transfer learning to dialects

🗣️ Arabic Dialects – Under‑Resourced to Low‑Resource

Focus on: Egyptian (arz), Levantine (apc), Gulf (afb), Maghrebi (ary), Sudanese (apd)

Challenges we address:

Lack of annotated data → data augmentation, semi‑supervised learning
Few parallel corpora (dialect ↔ MSA, dialect ↔ English)
Absence of dialect‑specific pretrained models

Our approach:

Cross‑lingual transfer from MSA to dialects
Few‑shot and zero‑shot learning for dialect tasks
Crowdsourced annotation and validation with native speakers

🔄 Arabic–Russian Bilingual NLP – Mid‑Resource

15,801,992 parallel sentences (our flagship corpus)
Sources: OPUS, TED, Baranov dictionary, Borisov dictionary, Sahih al-Bukhari, Sahih Muslim, 40 Hadith, Quran (Kuliev), phrasebook, Tatoeba
Length correlation: 0.925
Applications: translation, cross‑lingual retrieval, bilingual lexicography

🌍 Low‑Resource Pairs Involving Arabic – Low‑Resource

We focus on language pairs with minimal or no parallel data:

Pair	Resource Status	Our Work
Arabic ↔ Tatar	Very low	Data collection, transfer learning from Arabic–Russian + Russian–Tatar
Arabic ↔ Chechen	Extremely low	Zero‑shot translation via English or Russian pivot
Arabic ↔ Bashkir	Extremely low	Cross‑lingual embeddings
Arabic ↔ Hausa	Very low	Leveraging NLLB model
Arabic ↔ Somali	Very low	Data collection and annotation

🕌 Islamic Religious Texts – Domain‑Specific

We provide digitised, aligned, and machine‑readable versions of major Islamic texts:

Text	Description	Parallel Translation
The Quran	The holy book of Islam, 114 surahs	Russian (Elmir Kuliev), English (Sahih International)
Sahih al-Bukhari	Most authentic hadith collection (c. 7,000+ hadith)	Russian translation
Sahih Muslim	Second most authentic collection (c. 7,000+ hadith)	Russian translation
40 Hadith of al-Nawawi	Concise collection of 40 (or 41) essential hadith	Russian translation
Sunan Abu Dawud	One of the six major collections (Kutub al-Sittah)	Russian (in progress)
Jami` at-Tirmidhi	One of the six major collections	Russian (in progress)
Sunan an-Nasa'i	One of the six major collections	Russian (in progress)
Sunan Ibn Majah	One of the six major collections	Russian (in progress)

Applications:

Semantic search over hadith corpora
Question answering on Islamic texts
Classical Arabic morphological analysis
Cross‑collection hadith matching (e.g., finding the same hadith in Bukhari and Muslim)
Alignment of multiple translations for linguistic study

📖 Lexicographic Resources

Arabic‑Russian Dictionary – Kh.K. Baranov (latest edition) – digitised and aligned
Russian‑Arabic Dictionary – V.M. Borisov (latest edition) – bidirectional coverage
Machine‑readable formats for NLP integration

📚 Educational Resources

We believe in open education and reproducible research. All our tutorials and teaching materials are freely available.

Interactive Notebooks – Arabic NLP, dialect processing, Arabic–Russian MT, low‑resource techniques (in Python, using Hugging Face libraries)
Video Lectures – Recorded talks on Arabic morphology, dialect identification, and Islamic text processing
Course Materials – Slides, readings, and assignments from our university courses
Blog Posts – Deep dives into challenges and solutions for Arabic dialects and low‑resource pairs

🤝 Get Involved

We welcome contributions from the community – researchers, developers, students, native speakers, dialect speakers, and Islamic scholars.

For Researchers

Use our models and datasets (and cite us!)
Collaborate on dialect annotation or low‑resource pair projects
Contribute new benchmarks for dialects or Arabic–Russian MT

For Developers

Integrate our models into translation, search, or chatbot applications
Report bugs or suggest improvements via GitHub Issues
Submit pull requests to our open‑source repositories

For Native & Dialect Speakers

Help us validate dialect annotations and translations
Share dialect texts (with permission) to enrich our data
Provide feedback on model outputs to reduce errors

For Islamic Scholars & Students

Help verify Quranic verse alignments and hadith translations
Suggest improvements for religious text processing
Use our tools for digital Islamic studies

For Students

Use our demos and tutorials for learning
Participate in our mentorship program or summer schools
Start your own research project with our support

📊 Corpus Highlights

Our flagship resource – the Arabic–Russian Translation Corpus:

Statistic	Value
Total pairs	15,801,992
Length correlation	0.925
Arabic tokens	357.7M
Russian tokens	366.0M
Unique Arabic tokens	1,848,317
Unique Russian tokens	933,467
Sources	OPUS, TED, Baranov, Borisov, Sahih al-Bukhari, Sahih Muslim, 40 Hadith, Quran (Kuliev), phrasebook, Tatoeba

Most frequent Arabic words: في (13.68M), من (8.45M), على (5.59M)

Most frequent Russian words: и (15.88M), в (15.52M), по (5.38M)

🌐 Connect With Us

🤗 Hugging Face: ArabicNLPWorld – Models, datasets, and spaces
📧 Contact: arabicnlpworld@example.com

🔄 Ecosystem Integration

Our work is integrated with the broader Hugging Face ecosystem:

Models on the Hub with easy‑to‑use pipelines
Datasets with streaming and evaluation scripts
Spaces for interactive demos and educational tools
Gradio apps for user‑friendly interfaces

Empowering Arabic MSA, dialects, low‑resource pairs, and Islamic texts through open science and community collaboration.