You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I tried the lapdftext and created a very generic configuration file to be able
to parse many different PDF styles.
In the images produced for the sections seem really good. The header, headings,
figures and tables and the body are identified correctly. But regardless the
method I use to produce text, some parts are missing.
In the file fullText.txt I get most of the things marked as body and captions,
in openAccess.xml I get the abstract and references. What is completely missing
in both documents is the title.
I attached one example containing a PDF and all the output I got.
I downloaded lapdftext_unix_1_7_2-SNAPSHOT.tar.gz, extracted it and use the
shell scripts provided. When running the scripts I occasionally get the error
[Fatal Error] :1:14: The element type "root" must be terminated by the matching
end-tag "</root>". Is that the reason or is there a problem with my
configuration file?
Original issue reported on code.google.com by [email protected] on 5 Nov 2014 at 1:08
Original issue reported on code.google.com by
[email protected]
on 5 Nov 2014 at 1:08Attachments:
The text was updated successfully, but these errors were encountered: