Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Size difference between node and browser #82

Open
Marius-Romanus opened this issue Feb 6, 2023 · 6 comments · May be fixed by #83
Open

Size difference between node and browser #82

Marius-Romanus opened this issue Feb 6, 2023 · 6 comments · May be fixed by #83
Assignees

Comments

@Marius-Romanus
Copy link

Hi, there is a size difference calculating the same string type between the browser and node.

I understand that being only a string and not having objects or anything weird, it should be the same size, right?

Greetings!.

console.log("node sizeof()", sizeof('Lorem ipsum dolor sit amet, consectetur adipiscing elit. Sed ac vestibulum lacus, sit amet maximus libero. Aliquam erat volutpat. Quisque at orci tortor. Donec at mi nunc.'));
node sizeof() 184

console.log("browser sizeof()", sizeof('Lorem ipsum dolor sit amet, consectetur adipiscing elit. Sed ac vestibulum lacus, sit amet maximus libero. Aliquam erat volutpat. Quisque at orci tortor. Donec at mi nunc.'));
browser sizeof() 342

@miktam miktam self-assigned this Feb 7, 2023
@miktam
Copy link
Owner

miktam commented Feb 7, 2023

OK, the difference is coming from here https://github.com/miktam/sizeof/blob/master/indexv2.js#L88

Node.js uses precise string calculation. Here is the PR #80

The browser uses quite a simplistic approach, assuming that every string char is 2 bytes.

To be precise in the browser environment, let me check if there is a difference between different VM implementations.

@Marius-Romanus
Copy link
Author

Hello, I've been doing some research and it seems that the best options are.

For node: Buffer.byteLength(string);
For browser: (new TextEncoder().encode(string)).length;

I think this library has a good approach:
https://github.com/ehmicky/string-byte-length

What I don't know is the compatibility that you give since that library is in node version >=14.18.0

Regarding TextEncoder it seems to have good compatibility: https://caniuse.com/?search=TextEncoder

In the example that I have given, in both cases it gives a size of 171, which does not match what it gives now.

With a complex emoji gives: 🏳️‍🌈 gives 14
And with a simple emoji: 😀 gives 4

I also don't know if it differs with Cyrillic, Arabic, Chinese characters, etc.

Greetings

@miktam
Copy link
Owner

miktam commented Feb 7, 2023

@Marius-Romanus thank you for the investigation!
browser-based implementation seems useful, I added it here #83

Regarding node.js version, compatibility might be the issue, as you rightfully noted.
the current implementation is providing similar results (184 in the current version vs 171)

@Marius-Romanus
Copy link
Author

Hello, Buffer.byteLength exists in Node since the first versions, but I think it has been modified many times and I don't know the expected result in each of them or possible errors:
https://nodejs.org/docs/latest-v0.10.x/api/buffer.html#buffer_class_method_buffer_bytelength_string_encoding

Although I imagine that you have already seen it but I leave you the documentation (you can pass the type of encoding):
https://nodejs.org/dist/latest-v18.x/docs/api/buffer.html#static-method-bufferbytelengthstring-encoding

@ehmicky may have put the compatibility in for something else, or even for ECMAScript imports in Node. ;)

Greetings

@ehmicky
Copy link

ehmicky commented Feb 7, 2023

Hi everyone,

I am not completely sure I am answering your question correctly, but the reason this module does not support Node 12 is because Node 12 is not officially supported anymore. Also, please note Node 14 official support will be dropped in 2 months.

The main advantage of using string-byte-length directly instead of inlining Buffer.byteLength(string) and (new TextEncoder().encode(string)).length is that this library switches between 3 different implementations depending on the platform and input size, in order to give the best performance (see benchmarks).

Also, I think you might want to distinguish UTF-8 and UTF-16 when discussing about sizes. A string only has a specific byte size for a given encoding. As pointed out in your README, the JavaScript specification considers strings to be conceptually "somewhat" UTF-16, i.e. each character is 2 bytes. I mentioned "somewhat" because surrogate characters (U+d800 to U+dfff) and astral characters (U+10000 and above) are handled a little differently, and it depends on the JavaScript method being used.

However, in memory, over the network, or in a file, those strings are likely to be encoded in UTF-8, where each character can be 1, 2, 3 or 4 bytes long. string-byte-length gives out the UTF-8 size, not the UTF-16 size, and so does Buffer.from() and new TextEncoder(). IMHO knowing the UTF-8 size is more useful than UTF-16 in most use cases.

If you're interested about this topic, I wrote the following article which details the differences.

miktam added a commit that referenced this issue Feb 11, 2023
…string.

Tested on node v12 - works. Does not work on node v10.
Continuation of #82
@miktam
Copy link
Owner

miktam commented Feb 11, 2023

ok, latest PR works in node v12, but does not work in v10.

Let´s see if this is the best we can have.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants