-
-
Notifications
You must be signed in to change notification settings - Fork 2.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cases of retokenization invalidly assume that the tokenizer is stateless leading to crashes in various edge cases #12674
Comments
I think it must do something like read from start of line until end of line to take into account multi-line strings, but I did not check the code for how it handles them. |
Where is that happening? It seems like the only reason |
I think I poorly explained things there. So let's say you have: const a = "some stuff "; The string pub fn next(self: *Tokenizer) Token {
if (self.pending_invalid_token) |token| {
self.pending_invalid_token = null;
return token;
}
... However - when
however, the tokenizer will not be in the state it was when it returned that initial I hope that makes more sense now? |
Hmm, I think I understand now. It seems like this would be easiest solved by implementing #12449 |
@moosichu I found a better solution that should be relatively simple to implement. Instead of setting |
@zigazeljko I don't think making tokens larger is necessarily a better solution to this problem, #12449 is definitely the best longer term solution. However, I think fixing the crash is important, and merging the current workaround is better than nothing, and once that is done hopefully I can find some time to work on #12449 at some point. |
Zig Version
0.10.0-dev.5788+f6c9b78c8
Steps to Reproduce
This is mainly a blocker for #12661, and causes a crash there. However this issue is a design bug that is currently present in the code, and I'm sure there could be other ways trigger it.
Expected Behavior
So the following function in Ast.zig shows this issue quite well:
It expects that a new tokenizer can be created and effectively treats it as stateless provided the correct token_index. This is not the case.
Actual Behavior
This issue is why #11414 crashes instead of just causing an error.
The tokenizer makes use of a
pending_invalid_token
field to keep track of a bad token if nested inside of another one: e.g. if a bad symbol is found inside of a string.However, if you restart the tokenizer from an arbitrary character index that is within one of those tokens that can create
a pending_invalid_token
, then it will be in a different state at that index from when the file was originally parsed.I think the solution here is to roll the tokenizer back to the start of current line and retokenize from there until the correct & matching token is found at the desired character index. I'm working a PR that does that, but it will take me a couple of days to do so, so thought I would make this issue whilst I had a smidgen of time.
Apologies for the weird bug report, as this was one identified form looking at the code and trying to solve another issue which makes this bug symptomatic only with those additional changes from my PR. Also had to dash off so had to write it quickly, so will refine it later! Sorry if it doesn't make a lot of sense.
The text was updated successfully, but these errors were encountered: