Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Document builder #11

Open
wants to merge 17 commits into
base: master
Choose a base branch
from
Open

Document builder #11

wants to merge 17 commits into from

Conversation

stippi
Copy link

@stippi stippi commented May 19, 2016

The most important features of the „document builder“ framework are working.

According to the Word2007 RTF specs page 18 for control word \ucN, the
default value is to skip one byte after a Unicode character. I was
reading weird ‚?‘ characters after German Umlauts, until I understood
what the problem is. This RTF does not contain the \ucN control word
anywhere, but relies on the default.
There is a package „document“ which defines a bunch of basic interfaces
which represent aspects of an RTF document. DocumentPart refers to an
object that can contain paragraphs of styled text. There are currently
three areas of the document supported: The Header, Footer and the
document itself. The document is divided into Sections, which is a
concept supported by RTF and many text editing applications. Global
formatting properties can change at section breaks. Furthermore, the
Document contains a ColorTable, FontTable, StyleSheet and
DocumentSettings. The idea is that code that works with the document
model gets an instance of a Document and can get instances of all the
other interfaces from the document instance.
There is an implementation of all interfaces in the package
document.impl. The classes sometimes offer additional functionality,
but in the last iterations of changes, I have eliminated the need for
that by moving functionality into the interfaces.
In the package parser.builder, there is a new RtfListener
implementation called DocumentBuilder. The DocumentBuilder is created
with an instance of Document. Then it is passed to the
StandardParser.parse() method.
The idea behind DocumentBuilder is that it maintains a stack of
RtfContext instances and delegates all events to the currently active
context. RtfContext has an interface almost exactly like RtfListener,
except there is an additional processGroupStart() method version which
has the same parameters as processCommand(). The reason is that some
RTF groups start with a specific command and denote an RTF
„destination“. DocumentBuilder has therefore the notion of a „delayed
group start“. In processGroupStart() it will only mark that it is
currently at a group start, but does nothing further yet. In
processCommand() it will try to form a command based group start when
it encounters a destination command. Otherwise it will just process the
delayed group start before processing any other events.
Various RtfContext implementations exist which can already process
events to build many parts of the document model.
Fixed the copyright of all added files to also be the APL 2.0.
Introduced support for parsing annotations and storing them in the
document model. The objects contained in a Paragraph are no longer
Chunks, but Elements. Chunk inherits from Element, as does Annotation,
by inheriting from DocumentPart. That means an annotation can contain
any styled text, including text which itself contains annotations.
Now mentions the existence of the new DocumentBuilder parser and that
it is probably the parser to use.
Style is now split up into a base interface Style. It defines all
properties and Styles can have a name and return the overridden
properties.
The first derived interface is CharacterStyle and it defines setters
and getters for just the character style related properties.
The second derived interface is ParagraphStyle which also inherits
CharacterStyle. It has an additional method for creating a derived
CharacterStyle that has a ParagraphStyle as its parent.
StyleSheet no longer directly holds styles. Rather it defines getters
for instances of CharacterStyleTable and ParagraphStyleTable. The 2007
spec also knows table and section style tables which are not yet
implemented.
Work in DocumentBuilder towards implementing parsing the styles in the
style sheet section. Currently it only creates the styles, sets their
names and puts them into the correct table with their correct ID.
Also new is ignoring unknown groups. These are detected by a group
directly followed by an Command.optionalcommand, i.e. a command not in
the Command table. The complete group is ignored by the Document
Builder, which fixes some unwanted text ending up in the document.
LibreOffice apparently uses a new version of the spec which has some
section descriptions that we don’t understand which is no longer a
problem.
My name contains a German umlaut, and I guess it is helpful in general
to have a defined encoding.
… to conform with the rest of the project setup and eliminate the
warnings.
A Chunk in a Paragraph just needs to know a CharacterStyle. Changed
where ParagraphStyle was passed in the API, but a CharacterStyle should
be used instead.
Allow to get the parent and make a Style an exact copy of another Style
via setTo(). This is supposed to make it easier to use a specific style
from the style sheet once parsing that is implemented.
Fixing some warnings has introduced a bug, FontContext was not handling
\f and called the super class in the default case, which throws an
exception on „unexpected“ commands.
Also changed FontTable interface to store and look up fonts by id
rather than index.
Parameter evaluation was reversed. Extended unit test to check parsing
of the second paragraph of the test document.
... instead of directing the text to the document destination. Fixes for
example bookmark titles arriving in the text.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants