Tadabur: A Large-Scale Quran Audio Dataset

Abstract

Despite growing interest in Quranic data research, existing Quran datasets remain limited in both scale and diversity. To address this gap, we present Tadabur, a large-scale Quran audio dataset comprising more than 1400+ hours of recitation audio from over 600 distinct reciters, providing substantial variation in recitation styles, vocal characteristics, and recording conditions.

Tadabur includes complete coverage of all 113 surahs (without Al-Fatiha) of the Qur'an, spanning styles such as murattal and mujawwad. Each file is accompanied by automatically derived word-level temporal alignments and structured metadata in a consistent JSON schema.

This diversity makes Tadabur a comprehensive and representative resource for Quranic speech research—enabling advances in ASR, tajwīd-aware modeling, reciter identification, and prosodic analysis. By significantly expanding both the total duration and variability of available Quran data, Tadabur aims to support future research and facilitate the development of standardized Quranic speech benchmarks.

Dataset Pipeline

A fully automated, multi-stage process transforms raw long-form recitations into clean, verse-level annotated audio files.

🌐

Collection

Public Qur'an repositories & archives

→

🤖

LLM Metadata

Surah & reciter extraction via LLM

→

🎙️

WhisperX Align

Word-level timestamps & Ayah Alignment Module (AAM)

→

✂️

Boundary Detect

Recitation-stop segmenter

→

🧹

Curation

ASR filtering & deduplication

Dataset Statistics

Tadabur surpasses all prior publicly available Quranic datasets by a wide margin in scale, reciter diversity, and annotation richness.

Dataset	Samples	Reciters	Transcription	Word-Level Alignment
Quran Recitations (Kaggle)	6,689	12	✗	✗
Quran Speech-to-Text (SLR132)	226,129	30	✓	✗
Buraaq Quran Audio–Text	187,080	30	✓	✗
Tadabur Ours	365,000+	600+	✓	✓

Number of Reciters Across Quranic Datasets

12

Kaggle

30

SLR132

30

Buraaq

600+

Tadabur

Audio Samples

Each verse-level file is paired with a structured JSON annotation containing word-level timestamps, metadata, and speaker information. Two representative samples are shown below.

أَفَلَا يَتَدَبَّرُونَ ٱلْقُرْءَانَ ۚ وَلَوْ كَانَ مِنْ عِندِ غَيْرِ ٱللَّهِ لَوَجَدُوا۟ فِيهِ ٱخْتِلَـٰفًا كَثِيرًا

Reciter ID 88 Surah 3 · Ayah 82 Duration 10.9s

أَفَلَا يَتَدَبَّرُونَ ٱلْقُرْءَانَ أَمْ عَلَىٰ قُلُوبٍ أَقْفَالُهَآ

Reciter ID 94 Surah 46 · Ayah 24 Duration 5.5s

Fine-Tuned Whisper Models

Alongside the dataset, we release Whisper models fine-tuned on Tadabur for Quranic ASR. These models are domain-adapted to handle prolonged phoneme durations, tajwīd rules, melodic articulation, and the wide acoustic diversity unique to Qur'anic recitation.

Tadabur-Whisper-Small

Base: Whisper Small

Lightweight fine-tuned model optimized for fast inference and embedded systems while maintaining meaningful Quranic ASR capability.

Available on HF

Tadabur-Whisper-Medium

Base: Whisper Medium

Coming Soon

Tadabur-Whisper-Large

Base: Whisper Large v3

Coming Soon

Citation

If you use Tadabur in your research, please cite:

BibTeX

@misc{alherran2026tadabur,
  author    = {Alherran, Faisal},
  title     = {Tadabur: A Large-Scale Quran Audio Dataset},
  year      = {2026},
  url       = {https://github.com/fherran/tadabur},
  note      = {HuggingFace: huggingface.co/datasets/FaisaI/tadabur}
}