Releases: weblyzard/inscriptis
Releases · weblyzard/inscriptis
Integrated feedback obtained through the Journal of Open Source Software review process
- improved documentation based on feedback provided by @reality, @rlskoeser and @sbenthall as part of the Journal of Open Source Software review process.
- the Inscriptis web service has been included into the Python package and can now be started with
export FLASK_APP="inscriptis.service.web" python3 -m flask run
Improved document model, parsing of borderline cases & HTML annotation support
Changes
HTML parsing:
- new: improved model for handling text blocks and lines
- chg: improved HTML parsing of tables, enumerations and margins; fixed borderline cases
- chg: improved whitespace handling
- add: cover more borderline cases with unit tests
Inscriptis core:
- new: annotation support
- new: processing of annotation rules and annotation output
- new: type hints
- add: extended and improved documentation
Inscript command line client:
- new: added
--annotation-rules
option for annotation support. - new: added
--post-processor
option to export and visualize annotations (HTML, XML and surface form export) - chg: apply
--encoding
to Web URLs as well
Misc:
- chg: migrated to the semantic versioning schema described on https://semver.org/ for versioning.
Note
In terms of functionality, this release corresponds to Inscriptis 2.0rc2.
Fixed annotations for borderline cases
Please refer to https://github.com/weblyzard/inscriptis/releases/tag/2.0rc1 for a list of all new features. This release candidate fixes the following issues in rc1:
- fixed annotations for some borderline cases
- improved documentation compared to 2.0rc2
Improved document model, parsing of borderline cases & HTML annotation support
-
HTML parsing:
- new: new model for handling blocks and lines
- chg: improved HTML parsing of tables, enumerations and margins; fixed borderline cases
- chg: improved whitespace handling
- add: cover more borderline cases with unit tests
-
Inscriptis core:
- new: support for annotation rules and annotation output
- new: annotation post-processors (html, xml, surface form)
- new: type hints
- chg: extended and improved documentation
-
Inscript command line client:
- chg: apply
--encoding
to Web URLs as well
- chg: apply
1.2
Improved margin handling & more liberal licensing
- ignore top margins at the beginning of a document.
- more liberal licensing:
- the license change has been triggered by another project that created a Java port of inscriptis.
- to facilitate the free sharing of code and ideas between our two projects, we have (i) obtained the permission of all contributors for a license change, and (ii) changed the inscriptis license to the "Apache License 2.0".
Improved testing and Python 3.9 support
- minor performance improvements and code optimizations
- added Python 3.9 test environment
- improved test coverage
- updated package metadata
- improved tox configuration
Improved HTML rendering, command line client and Web service
- added support for rendering tags with the
white-space: pre
CSS attribute (e.g.<pre>
which is often used for formatting code). - API change: A
ParserConfig
object replaces the parametersdisplay_images
,dedpulicate_captions
,display_links
andindentation
inget_text()
and for initializing theInscriptis
class.
from lxml.html import fromstring
from inscriptis.model.config import ParserConfig
html_tree = fromstring(html)
# optional parser configuration fine tuning
config = ParserConfig(display_links=True, display_anchors=True)
parser = Inscriptis(html_tree, config)
text = parser.get_text()
- command line client:
- added option for displaying anchor links
--encoding
not sets the HTML and output encoding- new
--version
option
- Web service
- use the related CSS profile per default
- added
version
call
- Documentation fixes and improvements
Improved performance and code structure, documentation and unit testing
- improved performance and code structure.
- use metadata published in
./inscriptis/__init__.py
for versioning and in setup.py. - improved test coverage
- created sphinx API, usage and testing documentation which is published on https://inscriptis.readthedocs.org
- requires Python 3.5+ (dropped support for Python 2.7)
Correct inscript.py default indentation strategy.
Use the extended
indentation strategy per default as outlined in the README.md.