Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DRAFT: Replace w_bits field with a function to compute it #276

Closed

Conversation

brian-pane
Copy link

Posting for discussion because the results I'm getting are counterintuitive.

This change replaces one seldom-used field in the State structure with a function to compute the value. That function is called so infrequently that it doesn't even show up when the blogpost-compress benchmark is run with a sampling profiler (Linux perf record -F max -e instructions). But the result is a regression in deflate performance:

Benchmark 1 (69 runs): ./blogpost-compress-baseline-native 1 rs silesia-small.tar
  measurement          mean ± σ            min … max           outliers         delta
  wall_time          73.4ms ± 1.73ms    72.0ms … 85.5ms          4 ( 6%)        0%
  peak_rss           26.6MB ±  101KB    26.5MB … 26.7MB          0 ( 0%)        0%
  cpu_cycles          279M  ±  690K      278M  …  282M           1 ( 1%)        0%
  instructions        568M  ±  244       568M  …  568M           0 ( 0%)        0%
  cache_references    266K  ± 5.99K      262K  …  312K           4 ( 6%)        0%
  cache_misses        232K  ± 8.48K      205K  …  239K          11 (16%)        0%
  branch_misses      2.90M  ± 3.06K     2.89M  … 2.91M           0 ( 0%)        0%
Benchmark 2 (67 runs): ./target/release/examples/blogpost-compress 1 rs silesia-small.tar
  measurement          mean ± σ            min … max           outliers         delta
  wall_time          75.1ms ±  568us    74.3ms … 76.7ms          0 ( 0%)        💩+  2.4% ±  0.6%
  peak_rss           26.6MB ± 92.7KB    26.5MB … 26.7MB          0 ( 0%)          +  0.1% ±  0.1%
  cpu_cycles          289M  ±  728K      287M  …  292M           1 ( 1%)        💩+  3.4% ±  0.1%
  instructions        578M  ±  253       578M  …  578M           0 ( 0%)        💩+  1.8% ±  0.0%
  cache_references    266K  ± 5.38K      262K  …  293K           7 (10%)          +  0.1% ±  0.7%
  cache_misses        235K  ± 5.99K      214K  …  243K           6 ( 9%)          +  1.3% ±  1.1%
  branch_misses      2.88M  ± 2.89K     2.87M  … 2.89M           0 ( 0%)          -  0.8% ±  0.0%

@@ -1298,7 +1296,6 @@ pub(crate) struct State<'a> {
pub(crate) insert: usize,

pub(crate) w_size: usize, /* LZ77 window size (32K by default) */
pub(crate) w_bits: usize, /* log2(w_size) (8..16) */
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

my theory for the slowdown is that by removing this field, the following fields also move, and some relevant fields move in/out of a particular cache line. You can test that by just adding a dummy field of the same size (e.g. _unused: [u8; 8] and see if that changes anything.

@@ -1358,6 +1355,10 @@ enum DataType {
impl<'a> State<'a> {
pub const BIT_BUF_SIZE: u8 = BitWriter::BIT_BUF_SIZE;

pub(crate) fn w_bits(&self) -> usize {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it should not be needed, but you could try making this a const fn, and/or add a #[inline(always)] annotation. I don't expect either to do anything (this function is tiny so should always be inlined anyway), but maybe it has some unforseen effect.

@brian-pane
Copy link
Author

Confirmed: adding an 8-byte dummy field in the same place makes the performance regression disappear. What's surprising, though, is that the regression shows up as an increase in instructions, not just cycles.

Making the function const and #[inline(always)] without adding the dummy field doesn't change the results, which is good because it at least narrows down the problem.

Since the performance seems to be especially sensitive to the State field layout, I'll experiment with some more rearrangements of the fields.

@brian-pane brian-pane closed this Jan 5, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants