Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix DocBook XSD, making it deterministic again #4615

Open
wants to merge 5 commits into
base: master
Choose a base branch
from

Conversation

dirkbaechle
Copy link
Contributor

Overview of changes

Back when I created the first version of the SCons DocBook XSD, a lot of cruft was added by copy-n-paste. In addition, most SCons tags were allowed to appear 'almost everywhere', leading to it being reported as "not deterministic" by some XML scanners/validators.

This got corrected, and the new grammar is much stricter now, regarding where SCons-specific XML tags may be used.
I also had to patch doc files in several places accordingly.

Contributor Checklist:

  • I have tested the new XSD by running "docs-validate.py", "docs-update-generated.py" and a full release build locally.
  • I have updated CHANGES.txt (and read the README.rst).
  • I have checked whether the appropriate documentation should be updated, but found this to be n/a.

@bdbaddog
Copy link
Contributor

@dirkbaechle That moved the error, but still an error:

% python bin/docs-validate.py
0.46% (1/216) SCons/Action.xml
Traceback (most recent call last):
  File "/Users/bdbaddog/devel/scons/git/users/prs/bin/docs-validate.py", line 21, in <module>
    if SConsDoc.validate_all_xml(
       ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/bdbaddog/devel/scons/git/users/prs/bin/SConsDoc.py", line 445, in validate_all_xml
    if not tf.validateXml(fp, xmlschema_context):
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/bdbaddog/devel/scons/git/users/prs/bin/SConsDoc.py", line 340, in validateXml
    TreeFactory.xmlschema = etree.XMLSchema(xmlschema_context)
                            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "src/lxml/xmlschema.pxi", line 90, in lxml.etree.XMLSchema.__init__
lxml.etree.XMLSchemaParseError: local complex type: The content model is not determinist., line 94

@mwichmann mwichmann added documentation Release Any an all issues with releasing and packaging SCons itself labels Oct 13, 2024
@mwichmann mwichmann linked an issue Oct 13, 2024 that may be closed by this pull request
@mwichmann
Copy link
Collaborator

@dirkbaechle That moved the error, but still an error:

I see the same error.

@dirkbaechle
Copy link
Contributor Author

Okay, so I did some further digging:

  • I tried your suggestion with the "clean environment" via "venv + pip3 install -r requirements-pkg.txt". This works fine on my side (no errors), because pip3 then downloads and installs lxml v4.9.4 (lxml < 5 condition). This version uses libxml2 v2.10.3 under the hood. The next official release of lxml v5.0.0 is already using libxml2 v2.12.3, my guess is that you two were not using a lxml version < 5 when creating the error texts above.
  • Of course it would be nice to have lxml v5 versions working as well. So I downloaded the sources for libxml2, compiled them, and ran "xmllint" in different versions against the SCons DocBook XSD. For v2.10.3 no errors are shown, but with v2.12.3 up to the latest version v2.13.8 the "not deterministic" problem appears again.
  • This error happens in "dbhierx.xsd" while parsing the syntax for an "article", and here comes the funny part: it is contained in the original DocBook "dbhierx.xsd", too!
  • I have worked on a fix for the "article" syntax (actually "bookcomponent.content" to be more precise) and will post a new commit later this evening. Then the validation should work again, but this would also mean that our "SCons DocBook XSD" isn't simply an extension of the original "DocBook v4.5 XSD" anymore because we changed lines...instead of simply adding stuff. My guess is that you two don't care, but I wanted to mention this beforehand. ;)

@mwichmann
Copy link
Collaborator

Okay, so I did some further digging:

* I tried your suggestion with the "clean environment" via "venv + pip3 install -r requirements-pkg.txt". This works fine on my side (no errors), because pip3 then downloads and installs lxml v4.9.4 (lxml < 5 condition). This version uses libxml2 v2.10.3 under the hood. The next official release of lxml v5.0.0 is already using libxml2 v2.12.3, my guess is that you two were **not** using a lxml version < 5 when creating the error texts above.

In my case, for the check I let pip upgrade the otherwise-pinned lxml, pip reports: lxml 5.3.0

Be nice not to have to change the base docbook, but... if you're convinced it's not one of the SCons definitions which is somehow affecting that element, then it probably doesn't hurt, either. The DocBook team rewrote everything for the 5.x series and doesn't use these files any longer, so we shouldn't have the "divergent mods" issue common to vendoring. I discarded the idea of trying to Docbook 5.x since it looks like our adds would have to be considerably reworked, didn't seem worth the effort.

@dirkbaechle
Copy link
Contributor Author

dirkbaechle commented Oct 15, 2024

@mwichmann Yes, switching to DocBook 5.x isn't possible with our current tooling around documentation, because they don't support entities anymore (that we use for linking to tools/builders and such) IIRC.

In the meantime I found a rather crude "hack" to get things going again: removing line 99 (ref="lot") from the file doc/xsd/dbhierx.xsd.

This would make the XSD deterministic again, but would prevent us from using the "lot" tag early in an article document. We could still use it at the end of the document though (see https://tdg.docbook.org/tdg/4.5/article ).

My plan would be:

  • Open a new issue for properly fixing the XSD grammar, if this is possible at all.
  • Make the "hack" by removing the line in dbhierx.xsd, adding a TODO comment pointing to the actual issue.
  • Bump up the version number of the SCons DocBook XSD to 1.1 in all the right places.
  • Be happy, that we can move on for now. ;)
  • Care about the issue (might take a few weeks for proper analysis) and decide whether its worthwhile to further invest in it, or simply move on (e.g. towards AsciiDoc).

Questions:

  • Objections?
  • Do you want to bump the lxml version number in the requirements-dev.txt in the same go? Or do you plan to do further tests first?
  • We probably would have to also upload the new version of the XSD to the "scons.org" page at "https://www.scons.org/dbxsd/v1.1/" and make sure it stays there. Are we ready to do this?

@bdbaddog
Copy link
Contributor

I forgot to mention to just run pip install -U lxml after you use the requirements file, to see the error..
No need to download and build anything separately ;)

RE docbook 5.0, do they replace entities with something else? or is the concept dropped entirely? Would moving to DB 5.0 more future proof our doc process, while maintaining the bulk of it's current abilities?

@mwichmann
Copy link
Collaborator

Plan sounds reasonable to me, anyway (I'm not "the maintainer" :-) ). I'm guessing it's not really worth the investment to do a lot more fixing. The current state forces us to either pin an aging version of lxml or not be able to build the docs the current way, since the formal build includes a validation, so it fails, so that's not a good state to be in. Personally, I'm okay with just getting past that and not doing more (even though there may be another sticking plaster needed when something else changes) - don't consider that a "decision"!

@bdbaddog
Copy link
Contributor

Yes. I think fix and allow us to update our lxml is a great fix for this immediate problem.

My other questions above, are really to try an get a handle what our options are when we move on from our docbook 4.5 solution, and what we'd lose/gain with such changes.

@dirkbaechle
Copy link
Contributor Author

Okay, I'll do the simple hack first...however this might take a week or two, since I'm a little swamped at the moment.

Some more food for the "future of SCons docs" discussion:

  • I checked again, and it seems like we could save most of the document entities when switching to DocBook v5. They are still supported in general, it's just that you can't simply use the default XSLT for rewriting your XML documents from DB 4.5 to DB 5 (see https://docbook.org/docs/howto/2008-02-06/howto#convert4to5 ). I would probably write a "fixer" in Python that could do the main work of transforming the documents. Switching to DB5 would then mean that the validation is done against a RelaxNG file. Not a big showstopper, since this is directly supported by lxml (see https://lxml.de/1.3/validation.html ). But you need a RelaxNG grammar including the SCons-specific elements first. This has to be written and somewhat hand-crafted...and probably tested to some extent. ;)
    A small spike or prototype might show that a switch to DB5 is possible after all. But then we're still stuck to a DocBook toolchain and also stuck with having to support validation of the XML sources.
  • I look at this more from the angle of the user. The validation step was introduced, such that not every author has to setup a full DocBook toolchain (often large and a clunky process) for checking his edits before submitting a PR. Instead, he can rely on the fact that the XML docs are still "valid" after his edits, and can simply push his commits.
  • With a switch to e.g. AsciiDoc it would be possible to replace this "validation step" with a "visual inspection" of the output (or at least some preview, based on rendered AsciiDoc) by each user/author. The support of AsciiDoc previews and pre-renderers is already very good, and getting stronger, in more modern IDEs like VisualCode. I don't see this happen so much in the DocBook realm...
  • In the end, the project has to decide for a direction: either towards maintaining strict validation (->DocBook) in order to somewhat "protect" the current stem of XML sources, or towards more user friendly light-weight toolchains (->AsciiDoc, maybe even using something docker-based like https://github.com/docToolchain/docToolchain , so totally out-sourcing the documentation building in a way) regarding the basic editing of sections and paragraphs.
  • I have the feeling that the latter might bring more basic contributions and simple fixes, e.g. grammar/spelling/punctuation mistakes, by arbitrary users.
  • I would like to avoid a transition to DB5 first, and then having to move to AsciiDoc. That's simply doubling up the work. Maybe it's also a good idea to do a simple survey "DB5 vs AsciiDoc" on the Users ML, ... just to see what people think.

@dirkbaechle
Copy link
Contributor Author

dirkbaechle commented Oct 18, 2024

Just found this...other projects have similar problems: https://discourse.nixos.org/t/documentation-format/4650

...pointing also to
https://discourse.nixos.org/t/documentation-improvements/3111
NixOS/rfcs#64

@mwichmann
Copy link
Collaborator

(ugh, put this reply in the wrong place at first)

Yes, roughly been through those same thoughts. I prefer rst because it's a Python project, so we're already writing rst in the docstrings so it's less "switch" between docs-in-code and separate-docs. Someone else was an asciidoc fan and was ready to do the conversion a few years back. And MD is the most ubiquitous, but least expressive. None give us the easy ability to write doc for a feature together (function, cvars, builders, tools) and have them generated into their own sections like our extensions to docbook allow now. I suppose that could be written but that's another big chunk of work for a project that already has far too few contributing resources. Also, ability to generate into manpage (*roff) is still expected by distros who package SCons, asciidoc can do that natively but I'm less sure how easy that is with other forms.

@bdbaddog
Copy link
Contributor

bdbaddog commented Oct 18, 2024

realistically @mwichmann is responsible (historically) for the lion share of doc edits. For other users making edits, CI can catch issues so having every developer set up docbook toolchains not strictly necessary.

So I don't think we really need to query the users mailing list on this, we can make an executive decision amoung @mwichmann @dirkbaechle and @bdbaddog .

Currently we're using entities extensively to provide hyperlinking, does asciidoc have similar?
I guess the big question is what would we lose if we go to asciidoc?
Can we extract the examples and generate their output and include in the built docs?
Can we have hyperlinks with something similar to entities?
Can we generate PDF,HTML, Text, and manpage output?

@mwichmann
Copy link
Collaborator

You can create roles in restructured text... like the things that look like :meth:some-method - there could be :cvar:, :builder:, etc. that cause particular interpretation. I don't know how that works in any further detail.

adoc is fairly sophisiticated in that it can generate to docbook and then use the docbook chain to generate further stuff (it doesn't generate only to docbook, but is can). I'm not sure if that would be a good or bad thing for us though :-)

@dirkbaechle
Copy link
Contributor Author

I'm not sure that we could directly publish EPub from AsciiDoc, for example. There seems to be EPUB3 ( https://docs.asciidoctor.org/epub3-converter/latest/ ) but I haven't tried it, yet.
I don't see a lot that we'd lose when switching to AsciiDoc, I dare to say it's as "mighty" as DocBook for all our needs.

And even if we simply use AsciiDoc as "driver" for our current DocBook toolchain, being able to edit sources in a non-XML format can be seen as a pro. Plus we'd get rid of all the fiddling with XSD/RelaxNG schemata and validation errors, relying on AsciiDoctor to output valid DocBook as soon as its input is valid AsciiDoc...

Creating the SCons output from given examples, and the whole linking stuff around tools/builders/cvars and such, can probably be kept by adapting the SConsDoc module accordingly. This holds true for both, ReST and AsciiDoc.
For me, AsciiDoc simply feels like having the larger "expressiveness" and since Eclipse adopted AsciiDoc and pushes its usage and development (https://projects.eclipse.org/projects/asciidoc , https://asciidoc-wg.eclipse.org/), it will not simply "die" in the next few years.
Finally, I have to admit that I'm biased because we're now using AsciiDoc (doctolchain, ARC42, mainly for software design description and architecture stuff in general) at work...successfully. ;)

@mwichmann
Copy link
Collaborator

adoc is fine by me, excepting the one thing I mentioned: I'm doing Sphinx-aware reST in docstrings, which are gradually improving, thus the context-switching comment.

@bdbaddog
Copy link
Contributor

@dirkbaechle - just a ping to see if you had a chance to push your "simple hack" from above?

@dirkbaechle
Copy link
Contributor Author

Working on it...I'm creating a RelaxNG version of the "SCons DocBook 4.5 extensions" in compact format (RNC). This will be the new "master" for all derived formats like RNG and XSD.
My goal is to switch the internal validation to RelaxNG completely, such that we can still use the full DocBook DTD. I'm not sure if this will also make the XSD problem disappear, so the hack might not even be required anymore...anyway, it won't affect us when simply creating the doc outputs as usual.

Note: This approach should also make it easier to transition to DocBook v5.x, if this should ever come up. But I don't see a switch to DB 5.x happening, as long as the current doc toolchain is running fine again.

@bdbaddog
Copy link
Contributor

@dirkbaechle - a ping to see if you've had a chance to make any progress?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Release Any an all issues with releasing and packaging SCons itself
Projects
Status: In progress
Development

Successfully merging this pull request may close these issues.

lxml 5.x breaks doc validation
3 participants