-
-
Notifications
You must be signed in to change notification settings - Fork 499
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reduce size of Token
?
#1880
Comments
Love this investigation! I made various attempts to reduce the token size when I finished the lexer.
Reducing the token size will have significant performance improvements. Less memory + better cpu cache. Previous attempts are #32 and #151
We have a very easy performance measurement setup with codspeed, just make PRs and see their impacts. To get started, let's try and box the bigint and regex variant, then merge in the Atom change. To get aggressive, I also had the idea of making everything SoA (from the DoD talk https://vimeo.com/649009599), but unfortunately I don't have the time to experiment with such big changes. I'll give you write access to the repo if you wish to take Oxc's performance to another level. |
Thanks for your enthusiasm! After I submitted this issue, I rethought and concluded it was probably a waste of time, and reducing the size of I'm new to OXC's codebase, but from what I can see, the design means that only the tiny So would be interested to know why you think it might have a significant impact. But, like you say, the best way to find out is do it and look at the benchmarks. I'd like to finish Data-oriented design is not something I'd encountered before. Interesting. I'll watch that video when I get time. PS Thanks for offer of write access. Really kind of you to have such faith! However, I have sadly little time for coding, so I think you might find my commitment lacking. Also, I'm pretty new to Rust, so I think it's better if others check my work. So I'd prefer if you don't mind to continue just tinkering and submit PRs. |
Ah, I see now. Anyway, will try any make the changes above and see what benchmarks say. |
I'm nerd sniped, rewatching the DoD video ... wish me luck this weekend. |
Our token size would be 12 bytes if we remove the token value entirely. Different from the zig parser:
Making all the changes is going to be challenging, I'll plan them out. |
I watched the Andrew Kelley video too. Fascinating stuff. Something else occurred to me on this subject: As I understand it, the flow between parser and lexer goes as follows:
My thinking is: At any point in time in the parser, there's only a single token in play. So couldn't Lexer wouldn't reset Result is that I guess there are exceptions to this rule e.g. when parser receives the token But I'd imagine that'd be the minority of cases. So maybe better to handle storing a backlog of ambiguous tokens in the parser as a special case, and preserve the advantage of "only one token in play at a time" in the general case. I wonder if that might achieve some of the gains of a full-on DoD approach, without the massive overhaul of the codebase that it sounds like you're contemplating? Does this make any sense? I'm not that familiar with OXC's parser yet, so I could be missing something really obvious. |
This is untrue: oxc/crates/oxc_parser/src/lexer/mod.rs Line 92 in 24d209c
Theoretically yes, but parsing javascript is ridiculous, e.g. ASI: oxc/crates/oxc_parser/src/cursor.rs Lines 145 to 164 in 24d209c
escaped keyword: oxc/crates/oxc_parser/src/cursor.rs Lines 92 to 101 in 24d209c
Fun stuff: oxc/crates/oxc_parser/src/lexer/mod.rs Line 181 in 24d209c
|
I'm heading to bed so just a brief-ish reply on 2 points:
Ah yes, but that's in the lexer. My point was that the lexer only passes a single token back to the parser, and then that's the only token "in play" until next call to Well, at least that's if I understand how the parser works, which quite possibly I don't!
Yes, I hadn't considered that. Removing the drop on But... there's currently a memory leak due to The other solution (which you suggested above) is to defer converting a slice of code to a By the way, I wasn't trying to critique your planned approach. Just wanted to raise another option in case it's useful. |
I woke up today, and suddenly the plan you outlined before clicked for me. It's genius! Why optimize a structure, when you can remove it entirely? I had missed the point before. If parsing a string/bigint/number/regex from a chunk of source code can be deferred until the AST node is being created, then the only info required to do that is As a side effect, that also removes the extraneous lookups of Apologies I completely missed the point before. I do still wonder why it's necessary to copy |
This PR is part of #1880. `Token` size is reduced from 48 to 40 bytes. To reconstruct the regex pattern and flags within the parser , the regex string is re-parsed from the end by reading all valid flags. In order to make things work nicely, the lexer will no longer recover from a invalid regex.
This PR is part of #1880. `Token` size is reduced from 48 to 40 bytes. To reconstruct the regex pattern and flags within the parser , the regex string is re-parsed from the end by reading all valid flags. In order to make things work nicely, the lexer will no longer recover from a invalid regex.
This PR is part of #1880. `Token` size is reduced from 48 to 40 bytes. To reconstruct the regex pattern and flags within the parser , the regex string is re-parsed from the end by reading all valid flags. In order to make things work nicely, the lexer will no longer recover from a invalid regex.
This PR is part of #1880. `Token` size is reduced from 48 to 40 bytes. To reconstruct the regex pattern and flags within the parser , the regex string is re-parsed from the end by reading all valid flags. In order to make things work nicely, the lexer will no longer recover from a invalid regex.
This PR is part of #1880. Token size is reduced from 40 to 32 bytes.
This PR is part of #1880. Token size is reduced from 40 to 32 bytes.
We gain around 5 - 8% improvements from these 3 PRs, I'm now staring at |
We don't even need a |
The goal is to remove it from |
Part of #1880 `Token` size is from 32 to 16 bytes by changing the previous token value `Option<&'a str>` to a u32 index handle. It would be nice if this handle is eliminated entirely because the normal case for a string is always `source_text[token.span.start.token.span.end]` Unfortunately, JavaScript allows escaped characters to appear in identifiers, strings and templates. These strings need to be unescaped for equality checks, i.e. `"\a" === "a"`. This leads us to adding a `escaped_strings` `vec` for storing these unescaped and allocated strings. Performance regression for adding this vec should be minimal because escaped strings are rare. Background Reading: * https://floooh.github.io/2018/06/17/handles-vs-pointers.html
Part of #1880 `Token` size is from 32 to 16 bytes by changing the previous token value `Option<&'a str>` to a u32 index handle. It would be nice if this handle is eliminated entirely because the normal case for a string is always `source_text[token.span.start.token.span.end]` Unfortunately, JavaScript allows escaped characters to appear in identifiers, strings and templates. These strings need to be unescaped for equality checks, i.e. `"\a" === "a"`. This leads us to adding a `escaped_strings` `vec` for storing these unescaped and allocated strings. Performance regression for adding this vec should be minimal because escaped strings are rare. Background Reading: * https://floooh.github.io/2018/06/17/handles-vs-pointers.html
Part of #1880 `Token` size is from 32 to 16 bytes by changing the previous token value `Option<&'a str>` to a u32 index handle. It would be nice if this handle is eliminated entirely because the normal case for a string is always `source_text[token.span.start.token.span.end]` Unfortunately, JavaScript allows escaped characters to appear in identifiers, strings and templates. These strings need to be unescaped for equality checks, i.e. `"\a" === "a"`. This leads us to adding a `escaped_strings` `vec` for storing these unescaped and allocated strings. Performance regression for adding this vec should be minimal because escaped strings are rare. Background Reading: * https://floooh.github.io/2018/06/17/handles-vs-pointers.html
Down to 16 bytes in #1962. I think we've strived for a balance between code complexity and performance. Shall resolve this issue? |
Part of #1880 `Token` size is reduced from 32 to 16 bytes by changing the previous token value `Option<&'a str>` to a u32 index handle. It would be nice if this handle is eliminated entirely because the normal case for a string is always `&source_text[token.span.start.token.span.end]` Unfortunately, JavaScript allows escaped characters to appear in identifiers, strings and templates. These strings need to be unescaped for equality checks, i.e. `"\a" === "a"`. This leads us to adding a `escaped_strings[]` vec for storing these unescaped and allocated strings. Performance regression for adding this vec should be minimal because escaped strings are rare. Background Reading: * https://floooh.github.io/2018/06/17/handles-vs-pointers.html
Well this issue originally proposed reducing The diff on #1962 is quite large, and I won't have time to go through it properly for a while. So yes, feel free to close this issue, and if I can find an opportunity for further optimization later on, I'll open another one. What was the grand total of the performance boost from all this work in the end? +15%? |
Codspeed shows a reduction from 533 - 480, which is 11%. |
Down to 12 bytes in https://github.com/oxc-project/oxc/pull/2010/files |
This PR partially fixes oxc-project#1803 and is part of oxc-project#1880. BigInt is removed from the `Token` value, so that the token size can be reduced once we removed all the variants. `Token` is now also `Copy`, which removes all the `clone` and `drop` calls. This yields 5% performance improvement for the parser.
…#1926) This PR is part of oxc-project#1880. `Token` size is reduced from 48 to 40 bytes. To reconstruct the regex pattern and flags within the parser , the regex string is re-parsed from the end by reading all valid flags. In order to make things work nicely, the lexer will no longer recover from a invalid regex.
) This PR is part of oxc-project#1880. Token size is reduced from 40 to 32 bytes.
…t#1962) Part of oxc-project#1880 `Token` size is reduced from 32 to 16 bytes by changing the previous token value `Option<&'a str>` to a u32 index handle. It would be nice if this handle is eliminated entirely because the normal case for a string is always `&source_text[token.span.start.token.span.end]` Unfortunately, JavaScript allows escaped characters to appear in identifiers, strings and templates. These strings need to be unescaped for equality checks, i.e. `"\a" === "a"`. This leads us to adding a `escaped_strings[]` vec for storing these unescaped and allocated strings. Performance regression for adding this vec should be minimal because escaped strings are rare. Background Reading: * https://floooh.github.io/2018/06/17/handles-vs-pointers.html
Token
struct used in the parser is currently 48 bytes. This could be reduced to 40 bytes with a couple of reasonably small changes, or reduced to 32 bytes with more invasive changes.Anatomy of a
Token
Token
is currently structured as:Reducing the size
To reduce
TokenValue
to 24 bytes:1. Box
BigInt
BigInt
is the largest variant ofTokenValue
enum. ReplacingBigInt
withBox<BigInt>
would reduce that variant from 32 bytes to 8 bytes.This would impose a cost of indirection when reading the value of
TokenValue::BigInt
, but BigInts are fairly rare in JS Code, so for most code this cost would be insignificant.2. Remove
RegExp
variantMove
RegExpFlags
intoToken
(using one of the 5 bytes currently used for padding).Then all that remains in
RegExp
is a&str
, which is identical to theString
variant. So those 2 variants can be merged.Result is:
All the variants are now 16 bytes or less. The reason a discriminant is now required where it wasn't before is that the longest variant
&'a str
only has 1 niche which is insufficient to contain the discriminant. In current layout,BigInt
has multiple niches, so Rust "smuggles" the discriminant in those niches.Reducing
TokenValue
to 16 bytes3. Remove the discriminant
The discriminant is not required, as it duplicates information which is already present in
Token.kind
. e.g. AToken
with kindIdent
always contains a string invalue
, whereas kindNumber
is associated withf64
invalue
.So
TokenValue
could change from an enum to a union with no discriminant, saving a further 8 bytes.TokenValue::as_number
etc would become methods onToken
, which read thekind
field and return the requested value only ifkind
matches.The difficulty is in producing a safe interface which doesn't allow setting incompatible
kind
andvalue
pairs. Should be doable, but would require a lot of changes to the lexer.Alternative + further optimization
Once new
Atom
is implemented (#1803), it will be 16 bytes and contain multiple niches, so ifString(&'str)
is replaced withString(Atom<'a>)
, that will also reduceTokenValue
to 16 bytes without having to convert it to a union.However, there's also an inefficiency around the use of
TokenValue
.The parser is often doing things like this:
parse_literal_expression
branches onkind
. Then inparse_literal_number
, by callingvalue.as_number()
it branches again onTokenKind
's discriminant.This is unnecessary - we already know
value
is a number, or we wouldn't be inparse_literal_number
.I'd imagine correct branch prediction removes most of the cost of the 2nd branch, but this is presumably a hot path.
Again, the difficultly is in finding a safe interface to express this.
Questions
Is any of this worthwhile? Would reducing the size of
Token
likely make a measurable impact on performance?If so, how much work is it worth? Take the easy-ish gain of getting it down to 40 bytes, or go the whole hog to reach 32?
The text was updated successfully, but these errors were encountered: