Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

用于benchmark检测的数据集 #93

Open
wanghaisheng opened this issue Apr 23, 2018 · 18 comments
Open

用于benchmark检测的数据集 #93

wanghaisheng opened this issue Apr 23, 2018 · 18 comments

Comments

@wanghaisheng
Copy link
Owner

No description provided.

@wanghaisheng
Copy link
Owner Author

医疗类病历

*从互助平台收集的用于评估手机拍照类文本定位识别的数据集
https://github.com/wanghaisheng/huzhucases

@wanghaisheng
Copy link
Owner Author

www.icst.pku.edu.cn/cpdp/data/marmot_data.htm
Dataset for table recognition
In total, 2000 pages in PDF format were collected and the corresponding ground-truths were extracted utilizing our semi-automatic ground-truthing tool "Marmot".
The dataset is composed of Chinese and English pages at the proportion of about 1:1.

The Chinese pages were selected from over 120 e-Books with diverse subject areas provided by Founder Apabi library, and no more than 15 pages were selected from each book.
The English pages were crawled from Citeseer website.

The pages show a great variety in language type, page layout, and table styles. Among them, over 1500 conference and journal papers were crawled, covering various fields, spanning from the year 1970, to latest 2011 publications.
The e-Book pages are mostly in one-column layout, while the English pages are mixed with both one-column and two-column layouts.

@wanghaisheng
Copy link
Owner Author

Open Images数据集&挑战赛:

https://storage.googleapis.com/openimages/web/index.html

@wanghaisheng
Copy link
Owner Author

https://github.com/cs-chan/Total-Text-Dataset

Total Text Dataset - ICDAR 2017. It consists of 1555 images with more than 3 different text orientations: Horizontal, Multi-Oriented, and Curved, one of a kind.

@wanghaisheng
Copy link
Owner Author

数据集CTW: https://ctwdataset.github.io/
n this paper we provide details of a newly created dataset of Chinese text with about 1 million Chinese characters annotated by experts in over 30 thousand street view images. This is a challenging dataset with good diversity. It contains planar text, raised text, text in cities, text in rural areas, text under poor illumination, distant text, partially occluded text, etc. For each character in the dataset, the annotation includes its underlying character, its bounding box, and 6 attributes. The attributes indicate whether it has complex background, whether it is raised, whether it is handwritten or printed, etc.

32,285 high resolution images
1,018,402 character instances
3,850 character categories
6 kinds of attributes

@wanghaisheng
Copy link
Owner Author

wanghaisheng commented May 15, 2018

http://rrc.cvc.uab.es/?com=introduction
"Robust Reading" refers to the research area dealing with the interpretation of written communication in unconstrained settings. Typically Robust Reading is linked to the detection and recognition of textual information in scene images, but in the wider sense it refers to techniques and methodologies that have been developed specifically for text containers other than scanned paper documents, and include born-digital images and videos to mention a few.

Robust Reading is at the meeting point between camera based document analysis and scene interpretation, and serves as common ground between the document analysis community and the wider computer vision community.

The ICDAR Robust Reading Competition has been held five times [1-5], in 2003, 2005, 2011, 2013 and 2015. The competition is organized around challenges that represent specific application domains for robust reading. Challenges are selected to cover a wide range of real-world situations. Each challenge is set up around different tasks.

ICDAR2017

@wanghaisheng
Copy link
Owner Author

The Text Recognition Algorithm Independent Evaluation (TRAIT)
https://nvlpubs.nist.gov/nistpubs/ir/2017/NIST.IR.8199.pdf

default

@wanghaisheng
Copy link
Owner Author

链接: https://pan.baidu.com/s/12Wstdz_u8iwr7NEJGQtnZg 密码: 7p2m
HWDB2.2手写体VOC,需要的同志自取

@mvprasad58
Copy link

in marmot data set the table BBOX are not matching with original images

@cloudfool
Copy link

我想问下,有没有中文或者英文的 文本行的数据集?类似caffe-ocr人工合成的那种。

@wanghaisheng
Copy link
Owner Author

@cloudfool 大家都是结合自己实际处理的场景 套用现有的一些生成工具来造的
真实场景的话 英文的还挺多 中文的比较少 但可以用其他一些来造(比如你处理的是论文类型的文档)

@cloudfool
Copy link

请问英文的文本行数据集有哪些开源的?我找了很多,都是那种单词级的(比如ICDAR),我想要的是句子级别的。

@wanghaisheng
Copy link
Owner Author

@cloudfool 我上面列的你都看过了不~
https://github.com/NVlabs/ocroseg/tree/master/testdata
句子级别 你要什么样的句子 古登堡计划的电子书 小说诗歌啥的txt直接可以造啊 用numpy这些

@mttbx
Copy link

mttbx commented Jun 9, 2019

@wanghaisheng 你好,给你github上展示的163邮箱发了一个邮件,需要你的帮助兄弟!

@wanghaisheng
Copy link
Owner Author

@mttbx 我找不到原始文件了。

@LinnaWang76
Copy link

链接: https://pan.baidu.com/s/12Wstdz_u8iwr7NEJGQtnZg 密码: 7p2m
HWDB2.2手写体VOC,需要的同志自取

兄弟,链接过期了!

@wanghaisheng
Copy link
Owner Author

@LinnaWang76 sorry 我已经忘记文件名称,无法在pan中找到文件对其重新进行分享

@chixma
Copy link

chixma commented Nov 21, 2019

in marmot data set the table BBOX are not matching with original images

I am faced with the same issue. Do you have any idea about it later?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants