We are proud to announce the availability of the Spoken Wikipedia
Corpora. They consist of time-aligned spoken Wikipedia articles for
English, German, and Dutch, totaling more than 1,000h of audio by
numerous speakers about a wide variety of topics.
The Spoken Wikipedia Project is a community effort by volunteers as
part of the Wikipedia to record spoken versions of Wikipedia articles.
The corpora are freely available under a CC BY-SA license.
The annotations perfectly retain the original text and each part can
be traced back to its original place in the Wikipedia HTML. This
enables research with respect to spoken hypertext and its markup.
In addition to per-word alignments, we also provide phoneme-level
alignments for both German and English, generated by MAUS.
DE EN NL
#articles 1010 1314 3073
hours audio 386h 395h 224h
aligned 249h 182h 79h
ph-aligned 129h 77h —
This is, as far as we know, the largest corpus of freely-available
aligned speech for both German and Dutch and the largest
freely-availble corpus of aligned factual speech for English.
To download the corpora and obtain more information, please visit
There you can also obtain the pipeline to automatically download and
align a Spoken Wikipedia Corpus yourself. We also provide a template
to adapt the software for new languages. Note, however, that the
alignment process takes a significant amount of time.
If you use this resource in your research, please cite
Arne Köhn, Florian Stegen, Timo Baumann. 2016.
«Mining the Spoken Wikipedia for Speech Data and Beyond».
in Proceedings of LREC 2016.
or a later publication.