Name	Name	Last commit message	Last commit date
Latest commit dariok for hi/hi, collect all attributes Oct 19, 2023 c446eee · Oct 19, 2023 History 83 Commits
.gitignore	.gitignore	blocks of lower level within other blocks	Aug 30, 2021
.project	.project	Step 2:	Aug 29, 2019
LICENSE	LICENSE	Initial commit	Aug 29, 2019
README.md	README.md	update readme	Aug 13, 2023
build.xml	build.xml	set name	Jul 21, 2023
pdf2tei.scenarios	pdf2tei.scenarios	fix typo in scenario file name	Nov 28, 2021
pdf2tei.xpr	pdf2tei.xpr	house keeping	Feb 1, 2023
pt0-result.html	pt0-result.html	assume largest size as size of a line	Oct 19, 2023
pt0.xsl	pt0.xsl	assume largest size as size of a line	Oct 19, 2023
pt0.xspec	pt0.xspec	another try at finding lines	Aug 14, 2023
pt1-result.html	pt1-result.html	assume largest size as size of a line	Oct 19, 2023
pt1.xsl	pt1.xsl	assume largest size as size of a line	Oct 19, 2023
pt1.xspec	pt1.xspec	assume largest size as size of a line	Oct 19, 2023
pt2-result.html	pt2-result.html	improve adding spaces between runs	Jul 22, 2023
pt2.xsl	pt2.xsl	improve adding spaces between runs	Jul 22, 2023
pt2.xspec	pt2.xspec	improve adding spaces between runs	Jul 22, 2023
pt3-result.html	pt3-result.html	further improvements for div structure	Jul 22, 2023
pt3.xsl	pt3.xsl	added some documentation about styles	Jul 23, 2023
pt3.xspec	pt3.xspec	further improvements for div structure	Jul 22, 2023
pt4-result.html	pt4-result.html	further improve nesting and attributes for hi	Oct 6, 2023
pt4.xsl	pt4.xsl	further improvements for div structure	Jul 22, 2023
pt4.xspec	pt4.xspec	further improve nesting and attributes for hi	Oct 6, 2023
pt5-result.html	pt5-result.html	further improve nesting and attributes for hi	Oct 6, 2023
pt5.xsl	pt5.xsl	for hi/hi, collect all attributes	Oct 19, 2023
pt5.xspec	pt5.xspec	further improve nesting and attributes for hi	Oct 6, 2023

Name

Last commit message

Last commit date

dariok

for hi/hi, collect all attributes

Oct 19, 2023

c446eee · Oct 19, 2023

83 Commits

.gitignore

blocks of lower level within other blocks

Aug 30, 2021

Aug 29, 2019

Aug 29, 2019

Aug 13, 2023

Jul 21, 2023

fix typo in scenario file name

Nov 28, 2021

pdf2tei.xpr

house keeping

Feb 1, 2023

pt0-result.html

assume largest size as size of a line

Oct 19, 2023

pt0.xsl

assume largest size as size of a line

Oct 19, 2023

pt0.xspec

another try at finding lines

Aug 14, 2023

pt1-result.html

assume largest size as size of a line

Oct 19, 2023

pt1.xsl

assume largest size as size of a line

Oct 19, 2023

pt1.xspec

assume largest size as size of a line

Oct 19, 2023

pt2-result.html

improve adding spaces between runs

Jul 22, 2023

pt2.xsl

improve adding spaces between runs

Jul 22, 2023

pt2.xspec

improve adding spaces between runs

Jul 22, 2023

pt3-result.html

further improvements for div structure

Jul 22, 2023

pt3.xsl

added some documentation about styles

Jul 23, 2023

pt3.xspec

further improvements for div structure

Jul 22, 2023

pt4-result.html

further improve nesting and attributes for hi

Oct 6, 2023

pt4.xsl

further improvements for div structure

Jul 22, 2023

pt4.xspec

further improve nesting and attributes for hi

Oct 6, 2023

pt5-result.html

further improve nesting and attributes for hi

Oct 6, 2023

pt5.xsl

for hi/hi, collect all attributes

Oct 19, 2023

pt5.xspec

further improve nesting and attributes for hi

Oct 6, 2023

PDF2TEI

basic conversion from PDF to TEI trying to guess the structure of a text. Postprocessing required!

Usage

There are three basic ways to use this package. If you use oXygen, you can use the transformation scenario defined in the oXygen project (see below). Alternatively, you can use the ANT task defined in build.xml (see further below) or as a last option, do it manually.

oXygen transformation scenario

A general scenario is defined in pdf2tei.xpr. You may need to adjust the parameters, especially 'saxon' which contains the path to a JAR of the Saxon XSLT processor (e.g. saxon-he-10.5.jar, as is used in the example).

You can use a jar from the oXygen directories but not one of the oxygen-patched-saxon-9.jar (or similar). Alternatively, you can get the latest version from Saxonica ([https://www.saxonica.com/download/download_page.xml] for a complete selection of the available editions) or the current version of the Home Edition directly from sourceforge ([https://sourceforge.net/projects/saxon/files/Saxon-HE/10/Java/] for the current line of Saxon 10).

Using command line ANT

With ant available on your path, you can directly call ant to run the predefined workflow in build.xml. You need to set the parameters to the values for you situation:

name: the base name to be used for the resulting TEI file and the directory below outDir
outDir: path to the directory where the output is to be stored
pdf: path to the PDF file to be processed
saxon: path to a Saxon .jar (see the remarks in the previous section)

Example:

ant -Dname=pdftei -DoutDir=../output -Dpdf=../incoming/pdf-to-tei.pdf -Dsaxon=saxon-he-10.5.jar

General workflow

use pdftohtml -xml file.pdf to create a basic XML
apply pt1.xsl to pt4.xsl sequentially

Limitations

While these scripts try their best to guess a structure – headings, paragraphs – from the PDF, there are major limitations to this approach. Hence, the output is not valid TEI but must be postprocessed. We cannot, for instance, determine for certain whether a smaller passage is a footnote or a quotation without knowledge of the contents. Also, we can only assume that a page has a maximum of one line of heading and footer each. Pages with more than that will result in a wrong structure and possibly a column break.

To facilitate the postprocessing, values that were calculated during transformation were retained in the result. This means that there are the dimensional attributes @left, @top, @size, @bottom, and @right present for every line, and @height, @width, and @l (for the most frequently used @left of all lines) on pb. Additionally, all tei:l are comprised of one or more tei:hi with layout information (most importantly @rendition but also dimensional attributes).

Some contributions to this software were created within the scope of a project funded by the German BMBF, project ID 16TOA015A.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PDF2TEI

Usage

oXygen transformation scenario

Using command line ANT

General workflow

Limitations

About

Releases

Packages

Languages

License

dariok/pdf2tei

Folders and files

Latest commit

History

Repository files navigation

PDF2TEI

Usage

oXygen transformation scenario

Using command line ANT

General workflow

Limitations

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages