Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

空格字符没有办法识别。 #2

Open
magneter opened this issue Nov 20, 2018 · 6 comments
Open

空格字符没有办法识别。 #2

magneter opened this issue Nov 20, 2018 · 6 comments

Comments

@magneter
Copy link

英文单词之间是没有空格的。识别出来的英文段落,是混在一起的。作者有什么办法fix吗?

@white2018
Copy link

英文单词之间是没有空格的。识别出来的英文段落,是混在一起的。作者有什么办法fix吗?

这个要把空格也作为字符集试试看呢?还需要准备相应的数据集

@GlassyWing
Copy link
Owner

For this case,

  1. you should append space to dictionary file: char_std_5990.txt, It is located in dlocr/dictionary.
  2. change the config: desenet-default.json, which is located in dlocr/config, and set 'num_classes' to 5991.
  3. use this tool: https://github.com/Sanster/text_renderer or other to generate yourself dataset. So you need use text contains space to generate image with size (32, 280)(The size is used as default, but you can change it).
    Note that: The number of words cannot exceed 50 in one image. (But you can also change it)
  4. train it.

@GlassyWing
Copy link
Owner

对于这种情况,

  1. 你应该在字典文件中添加空格: char _ STD _ 5990 . txt,它位于dlocr /字典中。
  2. 更改位于dlocr / config中的config : desnet - default . JSON,并将“num _ class”设置为5991。
  3. 使用这个工具: https://github.com/Sanster/text_renderer或其他工具来生成你自己的数据集。因此,你需要使用包含空格的文本来生成大小为( 32,280 )的图像(默认情况下使用该大小,但是你可以更改它)。 请注意:一张图片中的字数不能超过50个。(但是你也可以改变它)
  4. 训练它。

@magneter
Copy link
Author

非常感谢大佬的耐心回答。是不是应该另外训练一个单独的ctc英文模型,会有更好的结果?

@magneter
Copy link
Author

中文和英文训练在同一个model,范化的acc可能会降低吧

@GlassyWing
Copy link
Owner

训练特定的英文模型效果可能会更好,但是就算训练在同一个model中,效果也不会差,densenet泛化能力很好。你可以同时生成带空格的中英文训练集,提升泛化能力

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants