Magichub - Awesome Audio and Text Corpus Collections

Magichub是人工智能领域的数据服务商Magic Data为向整个行业提供免费、开源的自有数据集而搭建的社区站点。

目前Magichub已经开源的数据集有68个，并仍在持续更新中。

Competitions

MagicData-RAMC-Challenge | MagicData中文重口音挑战赛

Overview | 赛程说明 | Datasets for Training | 训练集 Baseline | 基线
PS: A logged in account is required for free downloading datasets.

Datasets

Datasets License

Magic Data Open-Source License

Mandarin Chinese Scripted Speech Corpus – Keyword Spotting
中文普通话朗读音频数据集—唤醒词

This open-source dataset consists of 6 hours of transcribed Mandarin Chinese scripted speech of keyword spotting in fast, normal, and slow speed, where 11,030utterances contributed by 37 speakers were contained.

这个开源数据集由6小时转录的普通话中文脚本的关键字点燃，快速，正常和慢速，其中包含37个发言者的11,030个发音。

English Conversational Speech Corpus – Telephony
英语对话音频数据集-电话信道

This open-source dataset consists of 5.04 hours of transcribed English conversational speech beyond telephony, where 13 conversations were contained.

此数据集包含了5.04个小时的英语电话信道对话音频和转写文本，内容为10组对话。

Chinese English Scripted Speech Corpus – Children
中国人说英语朗读音频数据集-儿童

This open-source dataset consists of 1.44 hours of transcribed Chinese English scripted speech from children, where 2,266 utterances contributed by ten speakers, aged 7 or less, were contained.

此数据集包含了5.04个小时的英语电话信道对话音频和转写文本，内容为10组对话。

Pakistani English Scripted Speech Corpus
Daily Use Sentence - 巴基斯坦英语朗读数据集—日常用语

This open-source dataset consists of 4 hours of transcribed Pakistani English scripted speech focusing on daily use sentences, where 2,191 utterances contributed by seven speakers were contained.

此数据集包含了4个小时的巴基斯坦英语朗读音频和转写文本，内容为由7名说话人提供的2,191条日常用语语料。

French Audio Datasets

French Conversational Speech Corpus
法语对话音频数据集

This open-source dataset consists of 1.1 hours of transcribed French conversational speech on certain topics, where six conversations between two speakers were contained.

此数据集包含1.1个小时的法语对话音频和转写文本，内容为2组说话人之间的6组自由对话。

Korean Audio Datasets

Korean Conversational Speech Corpus
韩语对话音频数据集

This open-source dataset consists of 5.22 hours of transcribed Korean conversational speech on certain topics, where 22 conversations between seven pairs of speakers were contained.

此数据集包含了5.22个小时的韩语对话音频和转写文本，内容为7组说话人之间的22组给定主题对话。

German Audio Datasets

German Conversational Speech Corpus
德语对话音频数据集

This open-source dataset consists of 6.55 hours of transcribed German conversational speech on certain topics, where 10 conversations between two pairs of speakers were contained.

此数据集包含6.55小时的德语对话音频和转写文本，内容为两组说话人之间的10组特定主题对话。

German Scripted Speech Corpus – Command and Query
德语朗读音频数据集-命令控制

This open-source dataset consists of 0.71 hours of transcribed German scripted speech focusing on commands and queries, where 597 utterances contributed by ten speakers were contained.

此数据集包含了0.71小时的德语朗读音频和转写文本，内容为命令和控制。共有597条语料，由10名说话人提供。

Japanese Audio Datasets

Japanese Scripted Speech Corpus – Daily Use Sentence
日语朗读音频数据集-日常用语

This open-source dataset consists of 18 hours of transcribed Japanese scripted speech focusing on daily use sentences, where 17,372 utterances contributed by 37 speakers were contained.

此数据集包含了18个小时的日语朗读音频和转写文本，有17,372条由37名说话人提供的日常用语语料。

Italian Audio Datasets

Italian Scripted Speech Corpus – Command and Query
意大利语朗读音频数据集—命令控制

This open-source dataset consists of 0.9 hours of transcribed Italian scripted speech focusing on commands and queries, where 982 utterances contributed by ten speakers were contained.

此数据集包含了0.9个小时的意大利语朗读音频和转写文本，包含有982条由10名说话人提供的命令控制相关语料。

Italian Conversational Speech Corpus
意大利语对话音频数据集

This open-source dataset consists of 10.43 hours of transcribed Italian conversational speech on certain topics, where 28 conversations between three pairs of speakers were contained.

此数据集包含了10.43个小时的意大利语对话音频和转写文本，内容为三组说话人之间的28组给定主题对话。

Spanish Audio Datasets

Spanish Conversational Speech Corpus
西班牙语对话音频数据集

This open-source dataset consists of 5.56 hours of transcribed Peninsular Spanish conversational speech on certain topics, where 17 conversations between four pairs of speakers were contained.

此数据集包含了5.56个小时的西班牙半岛地区西班牙语对话音频和转写文本，内容为四组说话人之间的17组给定主题对话。

American Spanish Scripted Speech Corpus – Daily Use Sentence
美洲西班牙语朗读音频数据集-日常用语

This open-source dataset consists of 4.08 hours of transcribed American Spanish scripted speech focusing on daily use sentences, where 5,159 utterances contributed by ten speakers were contained.

此数据集包含了4.08个小时的美洲西班牙语朗读音频和转写文本，有5,159条由10名说话人提供的日常用语语料。

Russian Audio Datasets

Russian Scripted Speech Corpus – Daily Use Sentence
俄语朗读音频数据集-日常用语

This open-source dataset consists of 6.57 hours of transcribed Russian scripted speech focusing on daily use sentences, where 3,842 utterances contributed by ten speakers were contained.

此数据集包含了6.57小时的俄语朗读音频和转写文本，内容为日常用语。共有3,842条语料，由10名说话人提供。

Indonesian Audio Datasets

Indonesian Conversational Speech Corpus
印尼语对话音频数据集

This open-source dataset consists of 4.54 hours of transcribed Indonesian conversational speech on certain topics, where seven conversations between two pairs of speakers were contained.

此数据集包含4.54小时的印尼语对话音频和转写文本，内容为两组说话人之间的七组特定主题对话。

Indonesian Scripted Speech Corpus – Daily Use Sentence
印尼语对话音频数据集

This open-source dataset consists of 3.5 hours of transcribed Indonesian scripted speech focusing on daily use sentences, where 3,296 utterances contributed by ten speakers were contained.

此数据集包含了3.5个小时的印尼语朗读音频和转写文本，有3,296条由10名说话人提供的日常用语语料。

English Text Datasets

Chinese English Parallel Corpus of Ice&Snow Sports News
中国英语平行语料冰雪运动新闻

This dataset contains 100 pieces of news.

此数据集包含100条新闻资料。

Chinese-English Parallel Corpus – Finance
中国英语并行语料库 - 金融类

This open-source dataset consists of a hundred sentences of Chinese-English parallel corpus translated from Chinese to English, concerning finance-related daily use sentences.

此数据集由百句中的汉语平行语料库组成，包含中文和英语，关于金融领域日常使用的句子。

English Customer Service Scenario Text Corpus – Healthcare
英语客户服务方案文本语料库 - 医疗保健

This open-source dataset consists of 50 dialogic interactions with texts in English, concerning healthcare-related customer service scenarios.

此数据集包含50个与英语文本的问答互动，关于医疗保健相关的客户服务场景。

Korean Text Datasets

Korean Text Corpus – Command and Query
韩语智能家居命令控制语料库

This open-source dataset consists of a hundred sentences of commands and queries in Korean.

此数据集包含100条韩语命令控制相关文本语料。

Japanese Audio Datasets

Japanese Text Corpus – Command and Query
日语智能家居命令控制语料库

This open-source dataset consists of a hundred sentences of commands and queries in Japanese.

此数据集包含100条日语命令控制相关文本语料。

Magic Data Proprietary Datasets

Contact us if you need more training datasets for ML. [email protected]

Name		Name	Last commit message	Last commit date
Latest commit History 40 Commits
Samples - Mandarin Chinese Conversational Speech Corpus - Telephony		Samples - Mandarin Chinese Conversational Speech Corpus - Telephony
.gitattributes		.gitattributes
README.md		README.md

magichub-opensource/Magichub-Awesome-Datasets-and-Competitions

Folders and files

Latest commit

History

Repository files navigation

Magichub - Awesome Audio and Text Corpus Collections

Competitions

Datasets

Datasets License

French Audio Datasets

Korean Audio Datasets

German Audio Datasets

Japanese Audio Datasets

Italian Audio Datasets

Spanish Audio Datasets

Russian Audio Datasets

Indonesian Audio Datasets

English Text Datasets

Korean Text Datasets

Japanese Audio Datasets

Magic Data Proprietary Datasets

About

Resources

Stars

Watchers

Forks