Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Alphabetic distribution #1587

Merged
merged 19 commits into from
Feb 15, 2025
Merged

Conversation

1Git2Clone
Copy link
Contributor

@1Git2Clone 1Git2Clone commented Feb 14, 2025

  • Added a CHANGELOG.md entry

Summary

Added an Alphabetic distribution similar to Alphanumeric with the same derives and similar:

  • tests;
  • trait implementations;
  • benches;
  • documentation.

Note

There's no documentation in: src/lib.rs, src/rng.rs. Saying this because there's documentation about Alphanumeric there. The reason is that Alphabetic and Alphanumeric are pretty similar already so I didn't see much value in adding more generic examples compared to the type-level documentation.

Motivation

Had the need for an alphabetic random generation and figured it'd be helpful to have it integrated directly into the library for easier usage.

Details

Decided to use rng.random_range for the Distribution<u8> implementation due to it showing slightly better performance benchmarks (69cc117). With three benchmark runs on my system, here are the results with 3 runs:

Rejection sampling

random_bool/standard    time:   [1.1139 ns 1.1224 ns 1.1321 ns]
                        change: [+0.8541% +1.6869% +2.5284%] (p = 0.00 < 0.05)
                        Change within noise threshold. 
Found 114 outliers among 1000 measurements (11.40%)
  30 (3.00%) high mild
  84 (8.40%) high severe
random_bool/standard    time:   [1.1159 ns 1.1248 ns 1.1345 ns]
                        change: [-1.1600% -0.1566% +0.9021%] (p = 0.76 > 0.05)
                        No change in performance detected.
Found 134 outliers among 1000 measurements (13.40%)
  34 (3.40%) high mild
  100 (10.00%) high severe
random_bool/standard    time:   [1.1095 ns 1.1141 ns 1.1191 ns]
                        change: [-1.8864% -0.9835% -0.0256%] (p = 0.04 < 0.05)
                        Change within noise threshold.
Found 67 outliers among 1000 measurements (6.70%)
  24 (2.40%) high mild
  43 (4.30%) high severe

rng.random_range

random_bool/standard    time:   [1.0960 ns 1.1007 ns 1.1067 ns]
                        change: [-1.7071% -1.0058% -0.3238%] (p = 0.01 < 0.05)
                        Change within noise threshold.
Found 80 outliers among 1000 measurements (8.00%)
  39 (3.90%) high mild
  41 (4.10%) high severe
random_bool/standard    time:   [1.1012 ns 1.1069 ns 1.1136 ns]
                        change: [-0.3371% +0.2525% +0.8818%] (p = 0.42 > 0.05)
                        No change in performance detected.
Found 80 outliers among 1000 measurements (8.00%)
  40 (4.00%) high mild
  40 (4.00%) high severe
random_bool/standard    time:   [1.0997 ns 1.1048 ns 1.1104 ns]
                        change: [-0.5032% +0.1257% +0.7238%] (p = 0.70 > 0.05)
                        No change in performance detected.
Found 90 outliers among 1000 measurements (9.00%)
  36 (3.60%) high mild
  54 (5.40%) high severe

Specs

  • Ryzen 5 3600X
  • Arch Linux - 6.13.2-zen1-1-zen
  • i3wm

Note

Did do all the benches with a lot of my desktop apps in the background. That shouldn't be a problem because I didn't open new apps or use my background apps a lot.


Also, I'm a first time contributor so if I missed some implementation details or additional required documentation, feel free to tell me about it.

The rejection sampling equivalent performs slightly worse when I ran the
benchmarks from `benches/`.

Rejection sampling:

```
random_bool/standard    time:   [1.1139 ns 1.1224 ns 1.1321 ns]
                        change: [+0.8541% +1.6869% +2.5284%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 114 outliers among 1000 measurements (11.40%)
  30 (3.00%) high mild
  84 (8.40%) high severe
```

`rng.random_range`:

```
random_bool/standard    time:   [1.1012 ns 1.1069 ns 1.1136 ns]
                        change: [-0.3371% +0.2525% +0.8818%] (p = 0.42 > 0.05)
                        No change in performance detected.
Found 80 outliers among 1000 measurements (8.00%)
  40 (4.00%) high mild
  40 (4.00%) high severe
```
CHANGELOG.md Outdated Show resolved Hide resolved
Copy link
Member

@dhardy dhardy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the motivation for including this in rand — why do we want a distribution over exactly a-zA-Z and not, say, just over A-Z?

src/distr/other.rs Outdated Show resolved Hide resolved
src/distr/other.rs Outdated Show resolved Hide resolved
src/distr/other.rs Outdated Show resolved Hide resolved
src/distr/other.rs Outdated Show resolved Hide resolved
@1Git2Clone
Copy link
Contributor Author

What is the motivation for including this in rand — why do we want a distribution over exactly a-zA-Z and not, say, just over A-Z?

Had the need for a-zA-Z in particular and alphabetic values are more common to use than specifically upper case or lower-case alphabetic values individually.

@dhardy
Copy link
Member

dhardy commented Feb 14, 2025

There's no documentation in: src/lib.rs, src/rng.rs. Saying this because there's documentation about Alphanumeric there. The reason is that Alphabetic and Alphanumeric are pretty similar already so I didn't see much value in adding more generic examples compared to the type-level documentation.

This is fine, but Alphabetic should be mentioned in src/distr/mod.rs.

@dhardy
Copy link
Member

dhardy commented Feb 14, 2025

Had the need for a-zA-Z in particular and alphabetic values are more common to use than specifically upper case or lower-case alphabetic values individually.

Acceptable motivation for a small addition to the library. (It's always hard to tell exactly how many uses some thing has however.)

@1Git2Clone
Copy link
Contributor Author

1Git2Clone commented Feb 14, 2025

This is fine, but Alphabetic should be mentioned in src/distr/mod.rs.

I'll work on that.

Acceptable motivation for a small addition to the library. (It's always hard to tell exactly how many uses some thing has however.)

Understandable. I hope it's useful enough considering core also has things like char::is_alphabetic for related logic.

The benchmarks perform the same but the memory usage should be
lower.
@1Git2Clone
Copy link
Contributor Author

Forgot to remove debugging calls.

This still barely got a faster benchmark despite being one of the
quickest solutions on theory
@1Git2Clone
Copy link
Contributor Author

1Git2Clone commented Feb 14, 2025

Decided to test my code in a project of mine. What I found out was that char::from was considerably slower than extending the String type as a vector (around 32%):

The code in question just generates 1 million markdown lines one by one with a random heading level (1-6) and 128 random alphabetic characters. The performance without char::from was identical to using the Alphanumeric distribution.

I'm not the biggest fan of unsafe Rust, but considering std explicitly states that String types should all be valid UTF-8. I think it's reasonable to use unsafe code there since the additional char::from checks aren't necessary since we already know the exact distribution is guaranteed to be a-zA-Z.

And 32% is a massive deal...

This is the documentation for String::as_mut_vec:

This function is unsafe because the returned &mut Vec allows writing
bytes which are not valid UTF-8. If this constraint is violated, using
the original String after dropping the &mut Vec may violate memory
safety, as the rest of the standard library assumes that Strings are
valid UTF-8.

In our tested distribution we can guarantee that a-zA-Z is valid UTF-8.

@1Git2Clone
Copy link
Contributor Author

1Git2Clone commented Feb 15, 2025

PS:

It's also a bit quicker when you add a Vec::reserve_exact call.

So in the end if we compare:

  1. ~0.78s with the help of String::reserve_exact:
        string.reserve_exact(len);
        string.extend(self.sample_iter(rng).take(len).map(|c| c as char));

to

  1. ~0.54s with reasonable unsafe usage:
        unsafe {
            let v = string.as_mut_vec();
            v.reserve_exact(len);
            v.extend(self.sample_iter(rng).take(len));
        }

@dhardy
Copy link
Member

dhardy commented Feb 15, 2025

Yes, this is likely why I went with the unsafe code originally.

It feels like something the optimizer should be able to solve, but without an ASCII-byte type that may be a hard ask. It turns out there is an unstable addition to the library for this purpose: std::ascii::Char (with an open PR to impl Extend<AsciiChar> for String).

Thus I feel we should use that later, which would imply the sampling type of Alphabetic (and Alphanumeric) should change. I'll create a new issue.

So proceed with your original unsafe code for now (with a brief SAFETY comment).

@1Git2Clone
Copy link
Contributor Author

So proceed with your original unsafe code for now (with a brief SAFETY comment).

Will do.

It turns out there is an unstable addition to the library for this purpose: std::ascii::Char (with an open PR to impl Extend for String).

Thanks for mentioning that, maybe in the future these two parts could be updated to use that, but since it's still unstable it'd be a short while until that happens.

// See [#1590](https://github.com/rust-random/rand/issues/1590).
unsafe {
let v = string.as_mut_vec();
v.reserve_exact(len);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe we don't need to use reserve when using extend

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe we don't need to use reserve when using extend

Thought the same but on my 1 million markdown headings project from earlier project it yields noticeably better performance.

With reserve_exact:

generate_markdown  0.54s user 0.30s system 94% cpu 0.883 total
generate_markdown  0.53s user 0.32s system 92% cpu 0.925 total
generate_markdown  0.53s user 0.21s system 99% cpu 0.747 total

Without reserve_exact:

generate_markdown  0.79s user 0.32s system 95% cpu 1.167 total
generate_markdown  0.79s user 0.31s system 93% cpu 1.181 total
generate_markdown  0.77s user 0.33s system 94% cpu 1.164 total

And its not like we'd need more than we use when the method itself takes len as an argument.

When opening std the definition is this:

    fn extend<I: IntoIterator<Item = char>>(&mut self, iter: I) {
        let iterator = iter.into_iter();
        let (lower_bound, _) = iterator.size_hint();
        self.reserve(lower_bound);
        iterator.for_each(move |c| self.push(c));
    }

Yet using a reserve_exact before it yields much better performance when just generating an exact string like so:

Alphabetic.sample_string(rng, 128)

Honestly, I'm unsure why it's that much more performant, but it's been consistently like that. Feel free to also test it out if you want: taboc/utils/generate_markdown/.

@dhardy dhardy merged commit 8929123 into rust-random:master Feb 15, 2025
15 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants