Skip to content

dariok/pdf2tei

Folders and files

NameName
Last commit message
Last commit date

Latest commit

c446eee · Oct 19, 2023

History

83 Commits
Aug 30, 2021
Aug 29, 2019
Aug 29, 2019
Aug 13, 2023
Jul 21, 2023
Nov 28, 2021
Feb 1, 2023
Oct 19, 2023
Oct 19, 2023
Aug 14, 2023
Oct 19, 2023
Oct 19, 2023
Oct 19, 2023
Jul 22, 2023
Jul 22, 2023
Jul 22, 2023
Jul 22, 2023
Jul 23, 2023
Jul 22, 2023
Oct 6, 2023
Jul 22, 2023
Oct 6, 2023
Oct 6, 2023
Oct 19, 2023
Oct 6, 2023

Repository files navigation

PDF2TEI

basic conversion from PDF to TEI trying to guess the structure of a text. Postprocessing required!

Usage

There are three basic ways to use this package. If you use oXygen, you can use the transformation scenario defined in the oXygen project (see below). Alternatively, you can use the ANT task defined in build.xml (see further below) or as a last option, do it manually.

oXygen transformation scenario

A general scenario is defined in pdf2tei.xpr. You may need to adjust the parameters, especially 'saxon' which contains the path to a JAR of the Saxon XSLT processor (e.g. saxon-he-10.5.jar, as is used in the example).

You can use a jar from the oXygen directories but not one of the oxygen-patched-saxon-9.jar (or similar). Alternatively, you can get the latest version from Saxonica ([https://www.saxonica.com/download/download_page.xml] for a complete selection of the available editions) or the current version of the Home Edition directly from sourceforge ([https://sourceforge.net/projects/saxon/files/Saxon-HE/10/Java/] for the current line of Saxon 10).

Using command line ANT

With ant available on your path, you can directly call ant to run the predefined workflow in build.xml. You need to set the parameters to the values for you situation:

  • name: the base name to be used for the resulting TEI file and the directory below outDir
  • outDir: path to the directory where the output is to be stored
  • pdf: path to the PDF file to be processed
  • saxon: path to a Saxon .jar (see the remarks in the previous section)

Example:

ant -Dname=pdftei -DoutDir=../output -Dpdf=../incoming/pdf-to-tei.pdf -Dsaxon=saxon-he-10.5.jar

General workflow

  1. use pdftohtml -xml file.pdf to create a basic XML
  2. apply pt1.xsl to pt4.xsl sequentially

Limitations

While these scripts try their best to guess a structure – headings, paragraphs – from the PDF, there are major limitations to this approach. Hence, the output is not valid TEI but must be postprocessed. We cannot, for instance, determine for certain whether a smaller passage is a footnote or a quotation without knowledge of the contents. Also, we can only assume that a page has a maximum of one line of heading and footer each. Pages with more than that will result in a wrong structure and possibly a column break.

To facilitate the postprocessing, values that were calculated during transformation were retained in the result. This means that there are the dimensional attributes @left, @top, @size, @bottom, and @right present for every line, and @height, @width, and @l (for the most frequently used @left of all lines) on pb. Additionally, all tei:l are comprised of one or more tei:hi with layout information (most importantly @rendition but also dimensional attributes).


Some contributions to this software were created within the scope of a project funded by the German BMBF, project ID 16TOA015A.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published