Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use a better parser #1

Open
wants to merge 6 commits into
base: master
Choose a base branch
from
Open

Conversation

JKingweb
Copy link

@JKingweb JKingweb commented Oct 28, 2021

If one wants to have the same results as a browser, one must parse like a browser. This patch replaces the Masterminds parser with an independent implementation (of my and @dustinwilson's design) which is significantly more accurate. It's still limited by the PHP environment (i.e. no scripting, some DOM limitations) and is even slower, but has the following advantages:

  • Performs character encoding detection (including changing the encoding mid-parse)
  • Performs node adoption
  • Performs node fostering
  • Handles namespace-related stuff better (though not perfectly: PHP bugs lurk here)
  • Inserts implicit <tbody> elements
  • Uses less memory

Most of the patch is actually to fix test expectations rather than modify code. The test fixes can be grouped into a few categories:

  • Restoring attributes (e.g. xml:lang) the Masterminds parser drops
  • Removing xmlns:xlink attributes the Masterminds parser creates out of whole cloth
  • Restoring attribute values the Masterminds serializer omits for no apparent reason (e.g. alt now becomes alt="")
  • Dropping the values of boolean attributes on HTML elements (e.g. itemscope="itemscope" now becomes itemscope)
  • Removing the extra space in self-closing foreign start tags (e.g. <svg /> now becomes <svg/>)
  • Adding <tbody> tags for elements the Masterminds parser fails to create

One test ("infobae") has materially different output, though only with respect to whitespace nodes. When reformatted onto more than one line for readability the two outputs are identical, so I'm not sure if the difference matters.

I also cleaned up the libxml-based parsing logic a bit.

One thing to note is that our parser also allows specifying an authoritative encoding (such as from an HTTP header), but this is not currently exposed to the user by Readability. Consequently and for compatibility with the Factorio test UTF-8 is used as a fallback encoding rather than the usually recommended windows-1252.

@JKingweb
Copy link
Author

I noticed belatedly that the tests here are based off of Mozilla's tests with modifications originally to accomodate the differences in Masterminds' parsing and serializing. If there is interest in merging this patch I should probably give it a second pass by importing Mozilla's test unmodified so that we can all be more confident this port matches the JavaScript original's behaviour.

@fivefilters
Copy link
Owner

Thank you so much for this, and appologies for the very late reply. For some reason I didn't get a notification and have just noticed it now. I'll be taking a proper look and testing over the next few days. Will update again here.

@kadnan
Copy link

kadnan commented Sep 12, 2022

@fivefilters any updates?

@Dokukanal
Copy link

@fivefilters What's next here?

@fivefilters
Copy link
Owner

Unfortunately I did not have the time to test this out, and still do not have the time. The main focus for this repo at the moment is dealing with the open issues and then to update with latest changes to Readability.js.

Thinking again about this, I'd say that for many people the improvements listed here, which come at a speed cost, will be useful. But for others, especially those who already rely on the Masterminds parser in their project, or who need the extra speed, sticking with Masterminds will be better. So I'm not comfortable replacing the Masterminds dependency with this, but I think it would be useful to offer it as an option. I haven't thought through how that can be done, maybe as a composer suggestion.

@jensscherbl
Copy link

Maybe also worth keeping an eye on... PHP 8.4 comes with built-in HTML5 parsing capabilities later this year. See https://wiki.php.net/rfc/domdocument_html5_parser

@fivefilters
Copy link
Owner

We've made a start on using the HTML5 parser PHP 8.4 comes with (Lexbor): #32 @jensscherbl

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants