Customised-OCR-Correction

A word-list based OCR post-correction, originally designed for historical medical text.

This is a re-implementation of Thompson et al.'s algorithm for OCR post-correction. The implementation is built on the spellchecker "pyspellchecker". "Customised" refers to that suggested corrections are only used if their frequency in the OCR:ed corpus exceeds a cut-off. As a default, a cut-off of 2 is used, (i.e., at least two occurrences are required).

The type of OCR:ed corpus that the algorithm targets is one with a high quality OCR output, but for which it might be relevant to correct some remaining errors.

User-made word lists can be used for the correction. The spellchecker only has built-in word lists for a few languages, so for most languages, user-made word lists are required.

Note that the files are currently expected to consist of one main folder, with subfolders, which contain ".txt-files", i.e. as nonsense-texts

There are some additions to the original algorithm:

a) A word is not replaced if the frequency of the original word in the corpus is higher than the frequency for the spellchecker's suggestion for replacement

b) A compound-splitting of words is also added to the spell checker. What compound splitter to use is configurable, either you can write your own, or use an existing. It is thereby possible to adapt the compound splitting to the language of the text and to choose whether to use a compound splitter that is more a less generous with flagging words as correct.

c) The algorithm also attempts to locate words that are written with white space between characters and change these to words in which the charachters are not separated by white space.

d) Error statics are also calculated and plotted

Programming libraries needed

pip install pyspellchecker

conda install -c anaconda nltk

conda install -c conda-forge matplotlib

(Read more about the spell checker: https://pyspellchecker.readthedocs.io)

Acknowledgements

This work is part of the research project Acting out Disease: How Patient Organizations Shaped Modern Medicine (ActDisease). More information about the project can be found here: https://www.actdisease.org/

References

Thompson, P., McNaught, J. and Ananiadou, S. (2015) ‘Customised OCR correction for historical medical text’, in 2015 Digital Heritage. 2015 Digital Heritage, Granada, Spain: IEEE, pp. 35–42. Available at: https://doi.org/10.1109/DigitalHeritage.2015.7413829.

How we practically have used the code

Gather relevant word lists, e.g. from corpora, lexicons and terminologies. One word per row, as in the file example_word_lists/some_swedish_words.txt
Note the word lists in a file, one list per row, as in demo-word-lists.txt
Choose spell checker configuration, eg what compound splitter to use, the maximum edit distance, min occurrences in the OCR:ed corpus
Run on the corpus, with the chosen configuration. A number of output files will be created. In the created folder with the suffix '_statistics', there is a file starting with 'not_found'
Manually go through the outputfile 'not_found', and add frequent unknown words to the word-lists with known words. Some of the words will likely be OCR-missinterpretations of common words not in the word lists. (The unknown words are ordered in frequency.)
Run the correction functionality once more. This time go through the file 'replacements_made' and make sure that the replacements are correct. For incorrect replacements, you can either add the word on the list of words not to replace (as an argument when running the spellchecker), or add the word to the word-lists. Depending on if it’s more important for you to replace potential incorrect words or make sure not to replace unknown words that are in fact correct, you can choose to manually verify all, or only a subset of the replacements made.

Name		Name	Last commit message	Last commit date
Latest commit History 24 Commits
example_word_lists		example_word_lists
nonsense-texts		nonsense-texts
LICENSE		LICENSE
README.md		README.md
compare_to_word_lists.py		compare_to_word_lists.py
demo-word-lists.txt		demo-word-lists.txt
run_demo.py		run_demo.py
sample_graph.png		sample_graph.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Customised-OCR-Correction

Programming libraries needed

Acknowledgements

References

How we practically have used the code

About

Releases

Packages

Contributors 2

Languages

License

CDHUppsala/Customised-OCR-Correction

Folders and files

Latest commit

History

Repository files navigation

Customised-OCR-Correction

Programming libraries needed

Acknowledgements

References

How we practically have used the code

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages