Tadabur: A Large-Scale
Quran Audio Dataset

The most comprehensive and richly annotated Qur'anic recitation corpus to date

Faisal Alherran

1400+
Hours of Audio
600+
Distinct Reciters
365,000+
Verse-level Files
113
Surahs Covered (without Al-Fatiha)
· · · ✦ · · ·

Abstract

Despite growing interest in Quranic data research, existing Quran datasets remain limited in both scale and diversity. To address this gap, we present Tadabur, a large-scale Quran audio dataset comprising more than 1400+ hours of recitation audio from over 600 distinct reciters, providing substantial variation in recitation styles, vocal characteristics, and recording conditions.

Tadabur includes complete coverage of all 113 surahs (without Al-Fatiha) of the Qur'an, spanning styles such as murattal and mujawwad. Each file is accompanied by automatically derived word-level temporal alignments and structured metadata in a consistent JSON schema.

This diversity makes Tadabur a comprehensive and representative resource for Quranic speech research—enabling advances in ASR, tajwīd-aware modeling, reciter identification, and prosodic analysis. By significantly expanding both the total duration and variability of available Quran data, Tadabur aims to support future research and facilitate the development of standardized Quranic speech benchmarks.

Dataset Pipeline

A fully automated, multi-stage process transforms raw long-form recitations into clean, verse-level annotated audio files.

🌐
Collection
Public Qur'an repositories & archives
🤖
LLM Metadata
Surah & reciter extraction via LLM
🎙️
WhisperX Align
Word-level timestamps & Ayah Alignment Module (AAM)
✂️
Boundary Detect
Recitation-stop segmenter
🧹
Curation
ASR filtering & deduplication

Dataset Statistics

Tadabur surpasses all prior publicly available Quranic datasets by a wide margin in scale, reciter diversity, and annotation richness.

Dataset Samples Reciters Transcription Word-Level Alignment
Quran Recitations (Kaggle) 6,689 12
Quran Speech-to-Text (SLR132) 226,129 30
Buraaq Quran Audio–Text 187,080 30
Tadabur Ours 365,000+ 600+

Number of Reciters Across Quranic Datasets

12
Kaggle
30
SLR132
30
Buraaq
600+
Tadabur

Audio Samples

Each verse-level file is paired with a structured JSON annotation containing word-level timestamps, metadata, and speaker information. Two representative samples are shown below.

أَفَلَا يَتَدَبَّرُونَ ٱلْقُرْءَانَ ۚ وَلَوْ كَانَ مِنْ عِندِ غَيْرِ ٱللَّهِ لَوَجَدُوا۟ فِيهِ ٱخْتِلَـٰفًا كَثِيرًا
Reciter ID 88 Surah 3 · Ayah 82 Duration 10.9s
أَفَلَا يَتَدَبَّرُونَ ٱلْقُرْءَانَ أَمْ عَلَىٰ قُلُوبٍ أَقْفَالُهَآ
Reciter ID 94 Surah 46 · Ayah 24 Duration 5.5s

Fine-Tuned Whisper Models

Alongside the dataset, we release Whisper models fine-tuned on Tadabur for Quranic ASR. These models are domain-adapted to handle prolonged phoneme durations, tajwīd rules, melodic articulation, and the wide acoustic diversity unique to Qur'anic recitation.

Tadabur-Whisper-Small
Base: Whisper Small
Lightweight fine-tuned model optimized for fast inference and embedded systems while maintaining meaningful Quranic ASR capability.
Available on HF
Tadabur-Whisper-Medium
Base: Whisper Medium
Coming Soon
Tadabur-Whisper-Large
Base: Whisper Large v3
Coming Soon

Citation

If you use Tadabur in your research, please cite:

BibTeX
miscalherran2026tadabur
  author     {Alherran, Faisal}
  title      {Tadabur: A Large-Scale Quran Audio Dataset}
  year       {2026}
  url        {https://github.com/fherran/tadabur}
  note       {HuggingFace: huggingface.co/datasets/FaisaI/tadabur}