-
Notifications
You must be signed in to change notification settings - Fork 100
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Complete rewrite without RegExp #72
base: javascript
Are you sure you want to change the base?
Conversation
I appreciate the contribution. It's something to consider. FWIW, the reason I've pushed back on various "complete rewrites" in the past is:
None of these are deal breakers. But that hopefully helps explain why I think we'll be careful in the consideration of this complete rewrite rather than merging at first glance. |
Certainly. I'd say the biggest risk with a complete rewrite is the behaviour for untested cases, maybe some users are even relying on the current behaviour for malformed jsons? There is a lot more done than regexp, which may compromise performance (all those temp substr) and may impact readability (my version is more clear to me, but that's likely to be the case because I have written it, of course). For the question about removing newlines from strings, I think I can answer my own question having found out that newlines in strings are not allowed in json, the backslash od \n should be escaped. |
RegExp is not that good, or just Chrome is very fast iterating strings. Here are measures for the 26 MB string:
time 112 ms = 233405 chars/ms
time 228 ms = 114857 chars/ms
time 349 ms = 74818 chars/ms |
Just out of curiosity, can you check your same benchmark against this alternate regex? var tokenizer = /["\n\r]|(?:\/\*)|(?:\*\/)|(?:\/\/)/g That consolidates some of the alternation into a single character class, and then makes the three groupings non-capturing. I'm just wondering if that regex is more optimized to let Chrome's regex engine run faster? |
time 184 ms = 142460 chars/ms
time 342 ms = 76414 chars/ms
time 5 ms = 5125754 chars/ms
time 45 ms = 579631 chars/ms
time 251 ms = 104024 chars/ms I am not sure how to make sense of those measures, it probably depends on the content of the string, which is not ideal. |
It was straight-forward to port my code to c++ and c, see https://github.com/Virtual-X/JSON.minify/tree/cpp
|
Somewhat inspired by this: https://gist.github.com/WizKid/1170297, I have completely rewritten the minifier to achieve better performance. It passes the given tests and returns the same result as the upstream for all valid jsons that I have tested, except that it fixes this issue: #71 . Invalid jsons may return different results, e.g. I have seen upstream working with unescaped quoes inside strings, while my code toggles in_comment. Another note: The upstream code does remove newlines from inside strings, is it intended? I think it is wrong, but I did it as well to avoid introducing another difference that should be addressed separately. If it needs to be fixed, the test at line 50 can be simply removed.
Performances:
wireguard.txt mentioned here: Javascript: non-regex version about 16x-20x faster on my machine #52
source length 154741, minified length 154741
upstream time 18 ms = 8502 chars/ms
forked time 6 ms = 25367 chars/ms
https://github.com/json-iterator/test-data/blob/master/large-file.json
source length 26141343, minified length 26129992
upstream time 1164 ms = 22454 chars/ms
forked time 562 ms = 46490 chars/ms
https://github.com/cytoscape/cytoscape.js/blob/unstable/documentation/demos/cose-layout/data.json
source length 259660, minified length 186042
upstream time 36 ms = 7213 chars/ms
forked time 10 ms = 27048 chars/ms
22 MB/sec is not bad, but if you can do 45, I'd say why not? The code is also probably easier to port to other languages since it does not need a regex implementation, although in other languages there may be more effective ways to build large strings.