-
Notifications
You must be signed in to change notification settings - Fork 646
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Always emit constants in bytecode #3976
base: main
Are you sure you want to change the base?
Conversation
Currently, the bytecode generator adds constants to a const_buf. In doing this, it searches the const_buf for existing consts of the same value. This ends up being quadratic in the number of constants. I thought of adding a hashmap to make the lookup quicker, but I thought we could maybe just emit an instruction to push the constant onto the stack (which is what happens already if the const_buf is full). This will increase the generated bytecode size, but it makes the bytecode generator a lot faster (e.g. I saw get_const_offset taking 25% of the loading time on a 12kb module).
930086d
to
440a16b
Compare
I just wanted to open this PR to start a discussion about the tradeoffs between having this constant list and emitting loads vs. simply pushing the value onto the stack. |
I haven't looked at how much this increases module size, but it definitely decreases bytecode generation time (previously it was like 25% of the time in get_const_offset) |
The data will be helpful since involving constant space is based on a trade-off. @xujuntwt95329 @wenyongh Please join us. |
If we didn't want to pay the price in bytecode size, we could use a hashmap of value->offset to do the searches in ~constant time, at the cost of O(n) memory. |
The bytecode size or memory consumption is really important for us since fast-interp already increases footprint, the main design goal of constant offsets is to reduce the footprint since there may be many duplicated constants. Using hashmap is a choice, another choice may be to just sort the |
Would this approach require a second pass over the module data to collect all the constants? If you've already emitted slot indexes, we now cannot move the const from that slot (e.g. to insert a const before it) without rewriting the bytecode that was previously emitted. Also insertion sort like this is going to have average n^2 performance. Am I missing something? So, I think the choices are either a separate pass to collect all the constants, then sort them and binary search, or a hashmap of value -> index which we maintain alongside the const_buf to speed up searches. |
I mean that the sort/search operations are on the array of
In fact the prepare_bytecode of fast-interp scans the bytecode twice, now it seems that the two passes do the same on the Using hashmap is a good way, my understanding is that the Const struct array is kept and you will create a new hashmap? |
A different process for second traverse is a good idea, but it may waste some space when there are multiple duplicated values. Maybe there can be some special process for some common value (e.g. 0, 1) to reduce the waste. But anyway this is a tradeoff, seems there isn't a best way for all scenarios. |
Agree, there is realy not a best way. Since there are already two traverses in the loader of fast-interp, my suggestion is not to introduce an extra traversing, which also takes time and degrades performance. Instead, had better leverage the first traverse to do something. One way that I can think is to mark down the bytecode positions and calculate the total i64/f64 and i32/f32 const counts in the first traverse, for example: we add two link lists to the loader context, one is for i64/f64 with node like typedef struct I64Node {
uint8 *bytecode;
struct I64Node *next;
} I64Node;
typedef struct I32Node {
uint8 *bytecode;
struct I32Node *next;
} I32Node;
struct WASMLoaderContext {
...
I64Node *i64_node_list;
I32Node *i32_node_list;
uint32 i64_node_count;
uint32 i32_node_count;
...
}; And then add i64/f64.const bytecodes one by one to the first list and add i32/f32.const bytecodes one by one to the second list, and increase Then after the first traverse, we allocate a big buffer (an int64 array) for i64 nodes, traverse i64_node_list and copy i64/f64 const to the array one by one, and qsort the array buffer, and then remove duplicated consts. And same for i32_node_list. After that we can allocate a merged buffer for the two array, copy data to the new buffer and then free all previous buffer (two lists, two array buffer). Then in the second traverse, we can just use binary search (bsearch) to get the const offset when emitting bytecode. |
what about struct WASMLoaderContext {
...
size_t operand_offset; // points to beginning of operand within const_buf.data
Vector *const_buf;
HashMap *const_map; // with appropriate Const equals and hash functions
...
}
bool wasm_loader_get_const_offset(WASMLoaderContext *ctx, uint8 value, void *type, size_t *offset) {
Const search_key;
if (!const_from_value(value, type, &search_key)) return false;
void* entry = bh_hash_map_find(ctx->const_map, &search_key);
if (entry) {
*offset = -*(size_t*)offset;
return true;
}
// It's a value we haven't seen before
ctx->operand_offset += is_32bit_type(type) ? 1 : 2; // whatever the appropriate value should be
bh_vector_append(ctx->const_buf, search_key);
size_t* value = malloc(...);
*value = *offset = ctx->operand_offset; // whatever is necessary to point into ctx->const_buf.data
if (!bh_hash_map_insert(ctx->const_map, pointer_to_operand_offset)) { return false; }
return true;
} Basically augment const_buf with a hashmap to allow ~constant time lookups. We would need to handle the case where const_buf fills up (more than SHORT_MAX entries), this would be treated as it is now (emit param into bytecode). On the first traversal, the buf and map would be filled with unique values. On the second traversal, the value is either already in the map, or the buf filled up and we emit bytecodes. What do you think? |
It is basically OK for me, sounds like the performance is better than my idea but it will consume more memory. A concern is there may be many memory re-alloction when vector length increases. Had better not use @xujuntwt95329 how do you think? |
Thanks for the feedback, and yes in my sample code, it should be My solution will involve many more allocations: once inserted into the hash, the pointer to key cannot change, so that means that they can't go into e.g. a vector that will be reallocated on growth. Perhaps to reduce allocations, we can have a linked list of buffers each containing a given number of keys (e.g. 4kb key buffers). The advantage would be my solution should involve fewer memmoves/deletions, which your solution would presumably incur during the "remove duplicates" phase. Thanks for the feedback, I will try working on a PR to see if this will work. |
I think this is worth trying. And if the memory requirement increased too large, a compilation flag to make this configurable would be great, so that memory constrained devices won't be influenced. |
I added a new pull request here: #4010. Unfortunately it's not yet ready for submission, it is worse for small modules than what we currently have. Do you want to continue the discussion there? |
No description provided.