-
-
Notifications
You must be signed in to change notification settings - Fork 43
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Document builder #11
Open
stippi
wants to merge
17
commits into
joniles:master
Choose a base branch
from
stippi:document-builder
base: master
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Document builder #11
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
According to the Word2007 RTF specs page 18 for control word \ucN, the default value is to skip one byte after a Unicode character. I was reading weird ‚?‘ characters after German Umlauts, until I understood what the problem is. This RTF does not contain the \ucN control word anywhere, but relies on the default.
There is a package „document“ which defines a bunch of basic interfaces which represent aspects of an RTF document. DocumentPart refers to an object that can contain paragraphs of styled text. There are currently three areas of the document supported: The Header, Footer and the document itself. The document is divided into Sections, which is a concept supported by RTF and many text editing applications. Global formatting properties can change at section breaks. Furthermore, the Document contains a ColorTable, FontTable, StyleSheet and DocumentSettings. The idea is that code that works with the document model gets an instance of a Document and can get instances of all the other interfaces from the document instance. There is an implementation of all interfaces in the package document.impl. The classes sometimes offer additional functionality, but in the last iterations of changes, I have eliminated the need for that by moving functionality into the interfaces. In the package parser.builder, there is a new RtfListener implementation called DocumentBuilder. The DocumentBuilder is created with an instance of Document. Then it is passed to the StandardParser.parse() method. The idea behind DocumentBuilder is that it maintains a stack of RtfContext instances and delegates all events to the currently active context. RtfContext has an interface almost exactly like RtfListener, except there is an additional processGroupStart() method version which has the same parameters as processCommand(). The reason is that some RTF groups start with a specific command and denote an RTF „destination“. DocumentBuilder has therefore the notion of a „delayed group start“. In processGroupStart() it will only mark that it is currently at a group start, but does nothing further yet. In processCommand() it will try to form a command based group start when it encounters a destination command. Otherwise it will just process the delayed group start before processing any other events. Various RtfContext implementations exist which can already process events to build many parts of the document model.
Fixed the copyright of all added files to also be the APL 2.0. Introduced support for parsing annotations and storing them in the document model. The objects contained in a Paragraph are no longer Chunks, but Elements. Chunk inherits from Element, as does Annotation, by inheriting from DocumentPart. That means an annotation can contain any styled text, including text which itself contains annotations.
Now mentions the existence of the new DocumentBuilder parser and that it is probably the parser to use.
Style is now split up into a base interface Style. It defines all properties and Styles can have a name and return the overridden properties. The first derived interface is CharacterStyle and it defines setters and getters for just the character style related properties. The second derived interface is ParagraphStyle which also inherits CharacterStyle. It has an additional method for creating a derived CharacterStyle that has a ParagraphStyle as its parent. StyleSheet no longer directly holds styles. Rather it defines getters for instances of CharacterStyleTable and ParagraphStyleTable. The 2007 spec also knows table and section style tables which are not yet implemented. Work in DocumentBuilder towards implementing parsing the styles in the style sheet section. Currently it only creates the styles, sets their names and puts them into the correct table with their correct ID. Also new is ignoring unknown groups. These are detected by a group directly followed by an Command.optionalcommand, i.e. a command not in the Command table. The complete group is ignored by the Document Builder, which fixes some unwanted text ending up in the document. LibreOffice apparently uses a new version of the spec which has some section descriptions that we don’t understand which is no longer a problem.
My name contains a German umlaut, and I guess it is helpful in general to have a defined encoding.
… to conform with the rest of the project setup and eliminate the warnings.
A Chunk in a Paragraph just needs to know a CharacterStyle. Changed where ParagraphStyle was passed in the API, but a CharacterStyle should be used instead.
Allow to get the parent and make a Style an exact copy of another Style via setTo(). This is supposed to make it easier to use a specific style from the style sheet once parsing that is implemented.
Fixing some warnings has introduced a bug, FontContext was not handling \f and called the super class in the default case, which throws an exception on „unexpected“ commands. Also changed FontTable interface to store and look up fonts by id rather than index.
Parameter evaluation was reversed. Extended unit test to check parsing of the second paragraph of the test document.
... instead of directing the text to the document destination. Fixes for example bookmark titles arriving in the text.
Conflicts: README.md
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
The most important features of the „document builder“ framework are working.