Magichub是人工智能领域的数据服务商Magic Data为向整个行业提供免费、开源的自有数据集而搭建的社区站点。
目前Magichub已经开源的数据集有68个,并仍在持续更新中。
Overview | 赛程说明 | Datasets for Training | 训练集 Baseline | 基线
PS: A logged in account is required for free downloading datasets.
Magic Data Open-Source License
This open-source dataset consists of 6 hours of transcribed Mandarin Chinese scripted speech of keyword spotting in fast, normal, and slow speed, where 11,030utterances contributed by 37 speakers were contained.
这个开源数据集由6小时转录的普通话中文脚本的关键字点燃,快速,正常和慢速,其中包含37个发言者的11,030个发音。
This open-source dataset consists of 5.04 hours of transcribed English conversational speech beyond telephony, where 13 conversations were contained.
此数据集包含了5.04个小时的英语电话信道对话音频和转写文本,内容为10组对话。
This open-source dataset consists of 1.44 hours of transcribed Chinese English scripted speech from children, where 2,266 utterances contributed by ten speakers, aged 7 or less, were contained.
此数据集包含了5.04个小时的英语电话信道对话音频和转写文本,内容为10组对话。
This open-source dataset consists of 4 hours of transcribed Pakistani English scripted speech focusing on daily use sentences, where 2,191 utterances contributed by seven speakers were contained.
此数据集包含了4个小时的巴基斯坦英语朗读音频和转写 文本,内容为由7名说话人提供的2,191条日常用语语料。
This open-source dataset consists of 1.1 hours of transcribed French conversational speech on certain topics, where six conversations between two speakers were contained.
此数据集包含1.1个小时的法语对话音频和转写文本,内容为2组说话人之间的6组自由对话。
This open-source dataset consists of 5.22 hours of transcribed Korean conversational speech on certain topics, where 22 conversations between seven pairs of speakers were contained.
此数据集包含了5.22个小时的韩语对话音频和转写文本,内容为7组说话人之间的22组给定主题对话。
This open-source dataset consists of 6.55 hours of transcribed German conversational speech on certain topics, where 10 conversations between two pairs of speakers were contained.
此数据集包含6.55小时的德语对话音频和转写文本,内容为两组说话人之间的10组特定主题对话。
This open-source dataset consists of 0.71 hours of transcribed German scripted speech focusing on commands and queries, where 597 utterances contributed by ten speakers were contained.
此数据集包含了0.71小时的德语朗读音频和转写文本,内容为命令和控制。共有597条语料,由10名说话人提供。
This open-source dataset consists of 18 hours of transcribed Japanese scripted speech focusing on daily use sentences, where 17,372 utterances contributed by 37 speakers were contained.
此数据集包含了18个小时的日语朗读音频和转写文本,有17,372条由37名说话人提供的日常用语语料。
This open-source dataset consists of 0.9 hours of transcribed Italian scripted speech focusing on commands and queries, where 982 utterances contributed by ten speakers were contained.
此数据集包含了0.9个小时的意大利语朗读音频和转写文本,包含有982条由10名说话人提供的命令控制相关语料。
This open-source dataset consists of 10.43 hours of transcribed Italian conversational speech on certain topics, where 28 conversations between three pairs of speakers were contained.
此数据集包含了10.43个小时的意大利语对话音频和转写文本,内容为三组说话人之间的28组给定主题对话。
This open-source dataset consists of 5.56 hours of transcribed Peninsular Spanish conversational speech on certain topics, where 17 conversations between four pairs of speakers were contained.
此数据集包含了5.56个小时的西班牙半岛地区西班牙语对话音频和转写文本,内容为四组说话人之间的17组给定主题对话。
This open-source dataset consists of 4.08 hours of transcribed American Spanish scripted speech focusing on daily use sentences, where 5,159 utterances contributed by ten speakers were contained.
此数据集包含了4.08个小时的美洲西班牙语朗读音频和转写文本,有5,159条由10名说话人提供的日常用语语料。
This open-source dataset consists of 6.57 hours of transcribed Russian scripted speech focusing on daily use sentences, where 3,842 utterances contributed by ten speakers were contained.
此数据集包含了6.57小时的俄语朗读音频和转写文本,内容为日常用语。共有3,842条语料,由10名说话人提供。
This open-source dataset consists of 4.54 hours of transcribed Indonesian conversational speech on certain topics, where seven conversations between two pairs of speakers were contained.
此数据集包含4.54小时的印尼语对话音频和转写文本,内容为两组说话人之间的七组特定主题对话。
This open-source dataset consists of 3.5 hours of transcribed Indonesian scripted speech focusing on daily use sentences, where 3,296 utterances contributed by ten speakers were contained.
此数据集包含了3.5个小时的印尼语朗读音频和转写文本,有3,296条由10名说话人提供的日常用语语料。
This dataset contains 100 pieces of news.
此数据集包含100条新闻资料。
This open-source dataset consists of a hundred sentences of Chinese-English parallel corpus translated from Chinese to English, concerning finance-related daily use sentences.
此数据集由百句中的汉语平行语料库组成,包含中文和英语,关于金融领域日常使用的句子。
This open-source dataset consists of 50 dialogic interactions with texts in English, concerning healthcare-related customer service scenarios.
此数据集包含50个与英语文本的问答互动,关于医疗保健相关的客户服务场景。
This open-source dataset consists of a hundred sentences of commands and queries in Korean.
此数据集包含100条韩语命令控制相关文本语料。
This open-source dataset consists of a hundred sentences of commands and queries in Japanese.
此数据集包含100条日语命令控制相关文本语料。
Contact us if you need more training datasets for ML. [email protected]