-
Notifications
You must be signed in to change notification settings - Fork 100
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Algorithm is accidentally quadratic #64
Comments
This is a great point, thank you for finding it. Just for posterity, would you be willing to post a PR for the python code? I'm going to look at the JS code versions. But I'm not really familiar with the other language ports enough to do those reviews. |
I think this problem could exist in the JS version, here: Line 35 in 0827b7f
Although, the regex is rooted to the end with |
I also wonder if the current regex approach could work fine if we limit the "left hand context" string to a max length, say 250 characters. We could pick a large enough number that we know there wouldn't be a string of |
Now I'm wondering if that regex might work better with |
No problem. I don’t know how your release machinery works so I leave it to you.
For some regexen, backward matching might be possible, but greediness, backtracking, backreferences, UTF-8, and any combinations of the above make the general case complex, and I’m not surprised not all implementations bother with that. Measure.
Yes I could, and no it doesn’t. |
Thanks for checking/validating my questions about the RE. It seems like maybe the least disruptive (least risky) fix is what I suggested, simply performing the regex check against a fixed chunk of the final X number of characters of the left-hand context... I think 100 characters is plenty large enough heuristically to ensure no non-pathological inputs would be negatively affected. In JS, I can do: tmp2 = lc.slice(-100).match(/(\\)*$/); Could you check to see how that affects the performance, and how it compares to the loop approach you took? |
Yes, it is slightly faster than doing the backward loop. However, this would make it fail on inputs that contain a string value containing 50 escaped backslashes followed by an escaped double quote. |
Right, but if we pick a large enough fixed number (100, 250, 1000, whatever), we know we're not going to hit any real (non-pathological) such inputs. It's a performance-optimization heuristic. What do you think the smallest number is that we're safe there? If we were really worried about that obscure breakage case, we could have a strategy of checking to see if the matched string is the same length as the fixed number (all filled with BTW, I'm preferring the regex approach (with some heuristic) over switching to looping logic, not only because I think the regex engine can loop faster than user code can, but because a regex is more "portable" across languages than looping syntax, so it might be easier to port this fix (and for any future ports). |
I am strictly pro-correctness in this kind of issues. (I don’t even fully accept the premise of the tool — in my opinion, comments are invalid in JSON and if you want comments you should be using YAML at source and re-serialize it to JSON at build time.)
I think that leads to a loop of exponentially increasing fixed numbers and harder to follow code.
Only across those that have a regular expression engine built in or readily available, and if you can persuade it to perform. |
An elegant solution could be a reversed view on the left context, and forward-matching on that view. However, Python does not provide reversed string views, and doing a reverse string copy on each iteration is still quadratic. |
Yeah I considered that option too, but I agree that it probably will kill us in making all those string copies to reverse. JS doesn't have a string reverse, you'd basically have to create an array and do a reverse on the array (which is in-place), so that would be a full copy.
I appreciate your difference of opinion here, and that despite doubting the validity of this tool, you're still helpfully engaging to improve it! Yay open source! :) FWIW, I know a lot of people who use this tool not for removing comments but for removing whitespace (to essentially shrink down large JSON) before transmitting over the wire. That's why this tool does both comments and whitespace... it's really more a JSON minifier than a JSON-comment-remover. I happen to think comments in JSON can be useful when JSON is used for config, and I think the benefits of keeping with JSON instead of YAML are worth the need to minify it. But, anyway, no reason to re-litigate that. I accept your opinion as valid even if I disagree.
I appreciate that. However, I think we can come up with a pathologically bad (perf wise) example that even an optimized "correct" solution will still explode on. For example, with your suggested looping-backward approach... if there was 2.5mb of consecutive 2.5mb worth of consecutive We can and should optimize here. The question is, which cases do we want to care about, and which ones do we not care about? I don't see a way, without a significantly different algorithmic approach here, to not have some performance corner cases.
In some ways, I agree. I don't really want the code to include that option. But it would be a way to introduce a stepped-performance solution that still achieves "strict correctness".
That's true, but if we have an environment which can't rely on any sort of reasonable regex performance, I doubt they're going to have any other tool option or algorithm that's going to magically out perform this one to a significant degree. Here's an observation: the regex only needs to operate on the part of the left-context string that goes back to where the previous match was made. In the JS version of this code, that index is accessible via Line 32 in 0827b7f
In the python version of the code, IIUC, I think your PR is getting that index via 631e8b7#diff-ba6b740aa2c8ab3ba0b1c45306c8f2c9f0eae389c2adbddd005cc54bc0a4592aR46 Actually, we need to track/keep the PREVIOUS value of that, so we know what chunk of the text has just been skipped over from the previous match to this match. So, what if we just limit the RE to only run against that most recent chunk, rather than the whole string back to the absolute start? In JS: prevLastIndex = from;
from = tokenizer.lastIndex;
// ..
tmp2 = lc.substring(prevLastIndex).match(/(\\)*$/) In the general case, I think this will limit the RE to running only against a small portion of the overall string. The pathological case here is of course if there's a huge long string where no matches have been made for thousands or millions of characters. So we still have the performance downside to consider. And again, I say that we could cap the distance backward that the substring captures to a max of a few thousand characters and that would be a really good/safe tradeoff. But I'm not fully decided here. Just exploring "out loud". Happy to hear more input. |
The problem of minifying valid JSON can already be solved by just parsing and re-serializing, e.g.: import json
import sys
with open(sys.argv[1]) as infile:
content = infile.read()
print(json.dumps(json.loads(content), ensure_ascii=False, delims=(',', ':'))) or just by running jq as
No, it would only do that once for the potential quote delimiter they immediately precede. On the next quote, it will not reach those.
Hey, this is exactly it.
You lose some performance by copying the substring here, but you win by not matching repeatedly over the ever-growing left context. In Python, the match function accepts bounds for the original string, so we don’t even have to copy it.
No, it’s not pathological. You do have to process that chunk, but you only have to process it once. With the last match tracking, every character in the string is processed exactly once, making the whole algorithm linear again. |
Did you benchmark this "fix" against your JSON file? How does that compare to your original looping backward approach? |
Actually, good point, JS has a version of that: https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/RegExp/lastIndex We can set the prevLastIndex = from;
from = tokenizer.lastIndex;
// ..
lc.lastIndex = prevLastIndex;
tmp2 = lc.match(/(\\)*$/); |
#66 (regexp from last match): 32 seconds per 100 executions Both are a marked improvement over the original 76 minutes per single execution. I’m okay with either one. |
Thanks so much for your contributions here, @yurikhan. Very helpful! This seems like the best solution we've come up with, and hopefully the most portable. Some language ports may be "limited" in their regex features, to require, for example, the string copying. But I think all implementations will be able to track the previous match index to limit the left-context length, thereby (as you've pointed out) keeping the whole algorithm much more linear. I'm going to put in the fix for the JS, and approve your PR for the python. Not sure what to do with the other ports, but hopefully the maintainers of those branches are paying attention here and will pick up and apply the fixes to their own languages. If I don't see those addressed within the next few months, I will try my hand at adjusting those ports (though I won't really have as much language expertise or even ability to test). |
@yurikhan FYI: Lines 16 to 19 in a5ea0ba
|
OK, I posted candidate PRs for the existing implementations. I think they're correct But I need folks familiar with PHP, Perl, Objective-C, and Java to review these four PRs and validate (run the tests!!!!), complete, or fix any problems I've created:
If anyone listening can help review these PRs, or know someone who can, please do! |
Looking more closely at the performance characteristics, and benchmarking against an actual (large'ish) test file, per #52 (comment) It seems that the JS (and thus Node) version is still performance buggy, so... more work to do here. |
Unfortunately, the way JS applies its @yurikhan I also tweaked the regexes to use |
I did not really sign up for Python port maintainership and I consider my earlier PR satisfactorily fixes the original bug. If you want to tweak regexen to slightly improve the performance further, it might make sense if you did that in all ports, and should probably be a different issue. (I also wonder that you are matching for an arbitrary sequence of backslashes and then separately test the match length for being even or odd. Would it be better or worse to check specifically for an odd-length sequence like |
Hello there!
So today I tried to push a 2.5 megabyte file to my team’s Git server and it said “the pre-receive hook timed out after 120 seconds”.
I asked the CI priests (don’t laugh, it’s their official job description) what the 🔥 it could be doing, and they said “the only thing we do on JSON files is this” and supplied a small script involving the Python version of
json_minify
.I ran it on my file. It finished in 76 minutes 12 seconds.
So I looked at the algorithm and found the cause.
\\*$
in the whole substring from position 0 to current. This succeeds always before the end, and matches the longest span of backslashes, but it tries every possible starting position from 0 to the current position, unless the regular expression engine is smart enough to notice the anchor at the end and match backwards. (Spoiler warning: Most of them are not.)As a proof of concept, I made this change:
and it passed all existing tests and processed my 2.5M file in 0.697 seconds. It is definitely a lot less pythonic than the original way, but it sure does the job.
You might want to review all ports where this check is done via the same regexp, and verify which of them exhibit quadratic complexity.
There is nothing peculiar about my JSON file; just generate one that has a moderately large number of double quote characters.
The text was updated successfully, but these errors were encountered: