Add `Alphabetic` distribution #1587

1Git2Clone · 2025-02-14T15:30:52Z

Added a CHANGELOG.md entry

Summary

Added an Alphabetic distribution similar to Alphanumeric with the same derives and similar:

tests;
trait implementations;
benches;
documentation.

Note

There's no documentation in: src/lib.rs, src/rng.rs. Saying this because there's documentation about Alphanumeric there. The reason is that Alphabetic and Alphanumeric are pretty similar already so I didn't see much value in adding more generic examples compared to the type-level documentation.

Motivation

Had the need for an alphabetic random generation and figured it'd be helpful to have it integrated directly into the library for easier usage.

Details

Decided to use rng.random_range for the Distribution<u8> implementation due to it showing slightly better performance benchmarks (69cc117). With three benchmark runs on my system, here are the results with 3 runs:

Rejection sampling

random_bool/standard    time:   [1.1139 ns 1.1224 ns 1.1321 ns]
                        change: [+0.8541% +1.6869% +2.5284%] (p = 0.00 < 0.05)
                        Change within noise threshold. 
Found 114 outliers among 1000 measurements (11.40%)
  30 (3.00%) high mild
  84 (8.40%) high severe
random_bool/standard    time:   [1.1159 ns 1.1248 ns 1.1345 ns]
                        change: [-1.1600% -0.1566% +0.9021%] (p = 0.76 > 0.05)
                        No change in performance detected.
Found 134 outliers among 1000 measurements (13.40%)
  34 (3.40%) high mild
  100 (10.00%) high severe
random_bool/standard    time:   [1.1095 ns 1.1141 ns 1.1191 ns]
                        change: [-1.8864% -0.9835% -0.0256%] (p = 0.04 < 0.05)
                        Change within noise threshold.
Found 67 outliers among 1000 measurements (6.70%)
  24 (2.40%) high mild
  43 (4.30%) high severe

`rng.random_range`

random_bool/standard    time:   [1.0960 ns 1.1007 ns 1.1067 ns]
                        change: [-1.7071% -1.0058% -0.3238%] (p = 0.01 < 0.05)
                        Change within noise threshold.
Found 80 outliers among 1000 measurements (8.00%)
  39 (3.90%) high mild
  41 (4.10%) high severe
random_bool/standard    time:   [1.1012 ns 1.1069 ns 1.1136 ns]
                        change: [-0.3371% +0.2525% +0.8818%] (p = 0.42 > 0.05)
                        No change in performance detected.
Found 80 outliers among 1000 measurements (8.00%)
  40 (4.00%) high mild
  40 (4.00%) high severe
random_bool/standard    time:   [1.0997 ns 1.1048 ns 1.1104 ns]
                        change: [-0.5032% +0.1257% +0.7238%] (p = 0.70 > 0.05)
                        No change in performance detected.
Found 90 outliers among 1000 measurements (9.00%)
  36 (3.60%) high mild
  54 (5.40%) high severe

Specs

Ryzen 5 3600X
Arch Linux - 6.13.2-zen1-1-zen
i3wm

Note

Did do all the benches with a lot of my desktop apps in the background. That shouldn't be a problem because I didn't open new apps or use my background apps a lot.

Also, I'm a first time contributor so if I missed some implementation details or additional required documentation, feel free to tell me about it.

The rejection sampling equivalent performs slightly worse when I ran the benchmarks from `benches/`. Rejection sampling: ``` random_bool/standard time: [1.1139 ns 1.1224 ns 1.1321 ns] change: [+0.8541% +1.6869% +2.5284%] (p = 0.00 < 0.05) Change within noise threshold. Found 114 outliers among 1000 measurements (11.40%) 30 (3.00%) high mild 84 (8.40%) high severe ``` `rng.random_range`: ``` random_bool/standard time: [1.1012 ns 1.1069 ns 1.1136 ns] change: [-0.3371% +0.2525% +0.8818%] (p = 0.42 > 0.05) No change in performance detected. Found 80 outliers among 1000 measurements (8.00%) 40 (4.00%) high mild 40 (4.00%) high severe ```

CHANGELOG.md

src/distr/other.rs

…nner`

dhardy

What is the motivation for including this in rand — why do we want a distribution over exactly a-zA-Z and not, say, just over A-Z?

src/distr/other.rs

1Git2Clone · 2025-02-14T16:30:15Z

What is the motivation for including this in rand — why do we want a distribution over exactly a-zA-Z and not, say, just over A-Z?

Had the need for a-zA-Z in particular and alphabetic values are more common to use than specifically upper case or lower-case alphabetic values individually.

dhardy · 2025-02-14T16:31:23Z

There's no documentation in: src/lib.rs, src/rng.rs. Saying this because there's documentation about Alphanumeric there. The reason is that Alphabetic and Alphanumeric are pretty similar already so I didn't see much value in adding more generic examples compared to the type-level documentation.

This is fine, but Alphabetic should be mentioned in src/distr/mod.rs.

dhardy · 2025-02-14T16:32:51Z

Had the need for a-zA-Z in particular and alphabetic values are more common to use than specifically upper case or lower-case alphabetic values individually.

Acceptable motivation for a small addition to the library. (It's always hard to tell exactly how many uses some thing has however.)

1Git2Clone · 2025-02-14T16:38:12Z

This is fine, but Alphabetic should be mentioned in src/distr/mod.rs.

I'll work on that.

Acceptable motivation for a small addition to the library. (It's always hard to tell exactly how many uses some thing has however.)

Understandable. I hope it's useful enough considering core also has things like char::is_alphabetic for related logic.

The benchmarks perform the same but the memory usage should be lower.

1Git2Clone · 2025-02-14T17:37:26Z

Forgot to remove debugging calls.

This still barely got a faster benchmark despite being one of the quickest solutions on theory

1Git2Clone · 2025-02-14T23:49:28Z

Decided to test my code in a project of mine. What I found out was that char::from was considerably slower than extending the String type as a vector (around 32%):

https://github.com/1Git2Clone/taboc/blob/b08bfe3b2eaa9027cfed0d36a72718f2220166c6/utils/generate_markdown/Cargo.toml#L7-L9

The code in question just generates 1 million markdown lines one by one with a random heading level (1-6) and 128 random alphabetic characters. The performance without char::from was identical to using the Alphanumeric distribution.

I'm not the biggest fan of unsafe Rust, but considering std explicitly states that String types should all be valid UTF-8. I think it's reasonable to use unsafe code there since the additional char::from checks aren't necessary since we already know the exact distribution is guaranteed to be a-zA-Z.

And 32% is a massive deal...

This is the documentation for String::as_mut_vec:

This function is unsafe because the returned &mut Vec allows writing
bytes which are not valid UTF-8. If this constraint is violated, using
the original String after dropping the &mut Vec may violate memory
safety, as the rest of the standard library assumes that Strings are
valid UTF-8.

In our tested distribution we can guarantee that a-zA-Z is valid UTF-8.

1Git2Clone · 2025-02-15T09:33:27Z

PS:

It's also a bit quicker when you add a Vec::reserve_exact call.

So in the end if we compare:

~0.78s with the help of String::reserve_exact:

        string.reserve_exact(len);
        string.extend(self.sample_iter(rng).take(len).map(|c| c as char));

to

~0.54s with reasonable unsafe usage:

        unsafe {
            let v = string.as_mut_vec();
            v.reserve_exact(len);
            v.extend(self.sample_iter(rng).take(len));
        }

dhardy · 2025-02-15T10:17:50Z

Yes, this is likely why I went with the unsafe code originally.

It feels like something the optimizer should be able to solve, but without an ASCII-byte type that may be a hard ask. It turns out there is an unstable addition to the library for this purpose: std::ascii::Char (with an open PR to impl Extend<AsciiChar> for String).

Thus I feel we should use that later, which would imply the sampling type of Alphabetic (and Alphanumeric) should change. I'll create a new issue.

So proceed with your original unsafe code for now (with a brief SAFETY comment).

1Git2Clone · 2025-02-15T10:36:59Z

So proceed with your original unsafe code for now (with a brief SAFETY comment).

Will do.

It turns out there is an unstable addition to the library for this purpose: std::ascii::Char (with an open PR to impl Extend for String).

Thanks for mentioning that, maybe in the future these two parts could be updated to use that, but since it's still unstable it'd be a short while until that happens.

dhardy · 2025-02-15T14:15:12Z

src/distr/other.rs

+        // See [#1590](https://github.com/rust-random/rand/issues/1590).
+        unsafe {
+            let v = string.as_mut_vec();
+            v.reserve_exact(len);


I believe we don't need to use reserve when using extend

I believe we don't need to use reserve when using extend

Thought the same but on my 1 million markdown headings project from earlier project it yields noticeably better performance.

With reserve_exact:

generate_markdown 0.54s user 0.30s system 94% cpu 0.883 total generate_markdown 0.53s user 0.32s system 92% cpu 0.925 total generate_markdown 0.53s user 0.21s system 99% cpu 0.747 total

Without reserve_exact:

generate_markdown 0.79s user 0.32s system 95% cpu 1.167 total generate_markdown 0.79s user 0.31s system 93% cpu 1.181 total generate_markdown 0.77s user 0.33s system 94% cpu 1.164 total

And its not like we'd need more than we use when the method itself takes len as an argument.

When opening std the definition is this:

fn extend<I: IntoIterator<Item = char>>(&mut self, iter: I) { let iterator = iter.into_iter(); let (lower_bound, _) = iterator.size_hint(); self.reserve(lower_bound); iterator.for_each(move |c| self.push(c)); }

Yet using a reserve_exact before it yields much better performance when just generating an exact string like so:

Alphabetic.sample_string(rng, 128)

Honestly, I'm unsure why it's that much more performant, but it's been consistently like that. Feel free to also test it out if you want: taboc/utils/generate_markdown/.

1Git2Clone added 9 commits February 14, 2025 17:03

Add Alphabetic struct.

a6cf8f6

Implement SampleString for Alphabetic

9e8c041

Add Alphabetic tests

5bc7b40

Add Alphabetic documentation

dc0de22

Re-export Alphabetic

0d679b9

Add distribution tests for Alphabetic

1e3aec1

Add benchmarks for Alphabetic

df2fe26

Add Alphabetic distribution. (rust-random#1587)

3d88c67

benjamin-lieser reviewed Feb 14, 2025

View reviewed changes

CHANGELOG.md Outdated Show resolved Hide resolved

Remove Distributions subheading in Unreleased

4148dc7

benjamin-lieser reviewed Feb 14, 2025

View reviewed changes

src/distr/other.rs Show resolved Hide resolved

Map u8 iterator with char::from instead of working with `String.i…

be743d5

…nner`

1Git2Clone force-pushed the alphabetic-distr branch from cddbae0 to be743d5 Compare February 14, 2025 16:17

dhardy reviewed Feb 14, 2025

View reviewed changes

src/distr/other.rs Outdated Show resolved Hide resolved

src/distr/other.rs Outdated Show resolved Hide resolved

src/distr/other.rs Outdated Show resolved Hide resolved

src/distr/other.rs Outdated Show resolved Hide resolved

1Git2Clone added 3 commits February 14, 2025 18:23

Remove unneeded ASCII word

192ddd8

Remove unneeded Alphanumeric reference

e6ac021

Made the Alphabetic docs have 1 code block

b1f9b05

1Git2Clone added 2 commits February 14, 2025 18:42

Add brief module level description for Alphabetic

2f42f42

Mention Alphabetic in the Standard Uniform Distribution docs

f86462e

1Git2Clone force-pushed the alphabetic-distr branch from 353d7b5 to f86462e Compare February 14, 2025 16:55

Changed the range to u8

e5e59ba

The benchmarks perform the same but the memory usage should be lower.

Implement Alphabetic with a branchless jump

baca091

This still barely got a faster benchmark despite being one of the quickest solutions on theory

1Git2Clone force-pushed the alphabetic-distr branch from fc672b9 to baca091 Compare February 14, 2025 17:45

dhardy mentioned this pull request Feb 15, 2025

CHANGE: use std::ascii::Char #1590

Open

Reimplemented String::extend with &mut Vec<u8>

c9093cb

dhardy approved these changes Feb 15, 2025

View reviewed changes

dhardy merged commit 8929123 into rust-random:master Feb 15, 2025
15 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add `Alphabetic` distribution #1587

Add `Alphabetic` distribution #1587

1Git2Clone commented Feb 14, 2025 •

edited

Loading

dhardy left a comment

1Git2Clone commented Feb 14, 2025

dhardy commented Feb 14, 2025

dhardy commented Feb 14, 2025 •

edited

Loading

1Git2Clone commented Feb 14, 2025 •

edited

Loading

1Git2Clone commented Feb 14, 2025

1Git2Clone commented Feb 14, 2025 •

edited

Loading

1Git2Clone commented Feb 15, 2025 •

edited

Loading

dhardy commented Feb 15, 2025

1Git2Clone commented Feb 15, 2025

dhardy Feb 15, 2025

1Git2Clone Feb 15, 2025

Add Alphabetic distribution #1587

Add Alphabetic distribution #1587

Conversation

1Git2Clone commented Feb 14, 2025 • edited Loading

Summary

Motivation

Details

Rejection sampling

rng.random_range

Specs

dhardy left a comment

Choose a reason for hiding this comment

1Git2Clone commented Feb 14, 2025

dhardy commented Feb 14, 2025

dhardy commented Feb 14, 2025 • edited Loading

1Git2Clone commented Feb 14, 2025 • edited Loading

1Git2Clone commented Feb 14, 2025

1Git2Clone commented Feb 14, 2025 • edited Loading

1Git2Clone commented Feb 15, 2025 • edited Loading

dhardy commented Feb 15, 2025

1Git2Clone commented Feb 15, 2025

dhardy Feb 15, 2025

Choose a reason for hiding this comment

1Git2Clone Feb 15, 2025

Choose a reason for hiding this comment

Add `Alphabetic` distribution #1587

Add `Alphabetic` distribution #1587

1Git2Clone commented Feb 14, 2025 •

edited

Loading

`rng.random_range`

dhardy commented Feb 14, 2025 •

edited

Loading

1Git2Clone commented Feb 14, 2025 •

edited

Loading

1Git2Clone commented Feb 14, 2025 •

edited

Loading

1Git2Clone commented Feb 15, 2025 •

edited

Loading