-
Notifications
You must be signed in to change notification settings - Fork 35
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use a better parser #1
base: master
Are you sure you want to change the base?
Conversation
I noticed belatedly that the tests here are based off of Mozilla's tests with modifications originally to accomodate the differences in Masterminds' parsing and serializing. If there is interest in merging this patch I should probably give it a second pass by importing Mozilla's test unmodified so that we can all be more confident this port matches the JavaScript original's behaviour. |
Thank you so much for this, and appologies for the very late reply. For some reason I didn't get a notification and have just noticed it now. I'll be taking a proper look and testing over the next few days. Will update again here. |
@fivefilters any updates? |
@fivefilters What's next here? |
Unfortunately I did not have the time to test this out, and still do not have the time. The main focus for this repo at the moment is dealing with the open issues and then to update with latest changes to Readability.js. Thinking again about this, I'd say that for many people the improvements listed here, which come at a speed cost, will be useful. But for others, especially those who already rely on the Masterminds parser in their project, or who need the extra speed, sticking with Masterminds will be better. So I'm not comfortable replacing the Masterminds dependency with this, but I think it would be useful to offer it as an option. I haven't thought through how that can be done, maybe as a composer suggestion. |
Maybe also worth keeping an eye on... PHP 8.4 comes with built-in HTML5 parsing capabilities later this year. See https://wiki.php.net/rfc/domdocument_html5_parser |
We've made a start on using the HTML5 parser PHP 8.4 comes with (Lexbor): #32 @jensscherbl |
If one wants to have the same results as a browser, one must parse like a browser. This patch replaces the Masterminds parser with an independent implementation (of my and @dustinwilson's design) which is significantly more accurate. It's still limited by the PHP environment (i.e. no scripting, some DOM limitations) and is even slower, but has the following advantages:
<tbody>
elementsMost of the patch is actually to fix test expectations rather than modify code. The test fixes can be grouped into a few categories:
xml:lang
) the Masterminds parser dropsxmlns:xlink
attributes the Masterminds parser creates out of whole clothalt
now becomesalt=""
)itemscope="itemscope"
now becomesitemscope
)<svg />
now becomes<svg/>
)<tbody>
tags for elements the Masterminds parser fails to createOne test ("infobae") has materially different output, though only with respect to whitespace nodes. When reformatted onto more than one line for readability the two outputs are identical, so I'm not sure if the difference matters.
I also cleaned up the libxml-based parsing logic a bit.
One thing to note is that our parser also allows specifying an authoritative encoding (such as from an HTTP header), but this is not currently exposed to the user by Readability. Consequently and for compatibility with the Factorio test UTF-8 is used as a fallback encoding rather than the usually recommended windows-1252.