Reduce allocations for IIR filtering #614

martinholters · 2024-12-16T12:32:15Z

Avoid the necessity to re-allocate one of the coefficient vectors if they have different lengths by letting _filt_iir! handle different lengths.
~~(The second commit is some performance-neutral clean-up. I'm not 100% sure we should do it.)~~

For equal-length coefficient vectors performance is practically the same:

julia> @btime filt!($(zeros(100)), $(rand(2)), $([1; rand(1)]), $(rand(100))); # a and b both length 2
  344.546 ns (1 allocation: 64 bytes) # PR
  348.047 ns (1 allocation: 64 bytes) # master

julia> @btime filt!($(zeros(100)), $(rand(3)), $([1; rand(2)]), $(rand(100))); # a and b both length 3
  319.090 ns (1 allocation: 80 bytes) # PR
  317.876 ns (1 allocation: 80 bytes) # master

julia> @btime filt!($(zeros(100)), $(rand(2)), $([1; rand(2)]), $(rand(100)));
  294.293 ns (1 allocation: 80 bytes) # PR
  330.063 ns (2 allocations: 160 bytes) # master

julia> @btime filt!($(zeros(100)), $(rand(3)), $([1; rand(1)]), $(rand(100)));
  285.819 ns (1 allocation: 80 bytes) # PR
  332.861 ns (2 allocations: 160 bytes) # master

julia> @btime filt!($(zeros(100)), $(rand(100)), $([1; rand(1)]), $(rand(100)));
  1.247 μs (1 allocation: 896 bytes) # PR
  1.495 μs (2 allocations: 1.75 KiB) # master

codecov · 2024-12-16T12:39:16Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 97.97%. Comparing base (3e79857) to head (e3f404b).

Additional details and impacted files

@@           Coverage Diff           @@
##           master     #614   +/-   ##
=======================================
  Coverage   97.96%   97.97%           
=======================================
  Files          19       19           
  Lines        3248     3252    +4     
=======================================
+ Hits         3182     3186    +4     
  Misses         66       66

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

wheeheee · 2024-12-16T14:45:15Z

For some other combinations of lengths there seem to be regressions, although idk how significant.

Benchmarks

julia> @benchmark filt!(out, b, a, x) setup=((b, a) = (rand(rand(2:30)), [1.0; rand(rand(1:29))]); x = rand(100); out = similar(x))
BenchmarkTools.Trial: 10000 samples with 204 evaluations.   # master
 Range (min … max):  375.490 ns …  23.085 μs  ┊ GC (min … max): 0.00% … 94.88%
 Time  (median):     891.667 ns               ┊ GC (median):    0.00%
 Time  (mean ± σ):   909.695 ns ± 620.143 ns  ┊ GC (mean ± σ):  3.23% ±  5.87%

                      ▄▄█▃▁▁ ▁
  ▂▂▂▂▂▂▃▃▂▃▁▂▃▄▄▄▃▃▄▄████████▄▂▃▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ ▂
  375 ns           Histogram: frequency by time         1.79 μs <

 Memory estimate: 64 bytes, allocs estimate: 2.

BenchmarkTools.Trial: 10000 samples with 213 evaluations.   # PR
 Range (min … max):  345.540 ns …  35.139 μs  ┊ GC (min … max): 0.00% … 96.92%
 Time  (median):     916.432 ns               ┊ GC (median):    0.00%
 Time  (mean ± σ):   919.951 ns ± 449.703 ns  ┊ GC (mean ± σ):  1.04% ±  2.85%

                           ▁▄▃▃▅▇▆▇▇█▆▄▅▃▃▂▁▁
  ▂▃▂▂▂▂▂▂▂▃▃▄▄▄▄▄▄▄▄▄▅▅▅▆▇████████████████████▆▆▆▅▅▄▄▃▃▃▃▂▂▂▂▂ ▄
  346 ns           Histogram: frequency by time         1.39 μs <

 Memory estimate: 64 bytes, allocs estimate: 2.

julia> @benchmark filt!(out, b, a, x) setup=((b, a) = (rand(rand(2:30)), [1.0; rand(rand(1:29))]); x = rand(1_000); out = similar(x))
BenchmarkTools.Trial: 10000 samples with 8 evaluations.  # master
 Range (min … max):  3.487 μs … 826.413 μs  ┊ GC (min … max): 0.00% … 98.32%
 Time  (median):     8.488 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   8.397 μs ±   9.362 μs  ┊ GC (mean ± σ):  1.45% ±  1.38%

                              ▄▅█▇▇▅▄▂▂▃▂
  ▂▃▂▃▂▃▂▂▄▂▃▂▃▃▂▄▂▂▄▅▃▃▅▃▅▃▃▄███████████▆▆▆▅▃▃▂▂▂▂▁▁▁▁▁▁▁▁▁▁ ▃
  3.49 μs         Histogram: frequency by time        12.7 μs <

 Memory estimate: 64 bytes, allocs estimate: 2.

BenchmarkTools.Trial: 10000 samples with 8 evaluations.  # PR
 Range (min … max):  3.225 μs … 29.587 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     8.825 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   8.899 μs ±  2.204 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

                        ▁▂▄▆█▆█▆▆▆▅▄▂▂▁
  ▂▃▂▃▂▃▂▃▄▄▄▄▃▄▄▅▄▄▄▅▆▇█████████████████▇▇▆▆▅▅▄▄▄▃▄▃▃▃▃▃▃▂▂ ▄
  3.22 μs        Histogram: frequency by time        14.1 μs <

 Memory estimate: 64 bytes, allocs estimate: 2.

julia> @benchmark filt!(out, b, a, x) setup=((b, a) = (rand(rand(2:30)), [1.0; rand(rand(1:29))]); x = rand(10_000); out = similar(x))
BenchmarkTools.Trial: 10000 samples with 1 evaluation.  # master
 Range (min … max):  34.900 μs … 399.500 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     84.300 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   84.039 μs ±  20.555 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

                           █ ▄▄▁▁ ▁
  ▄▂▃▁▂▁▃▁▃▂▁▄▃▂▂▂▄▄▂▄▂▄▂▄▅█▆████▇█▄▆▇▄▅▄▄▃▃▃▃▂▂▂▂▂▁▁▁▁▁▁▁▁▁▁▁ ▃
  34.9 μs         Histogram: frequency by time          138 μs <

 Memory estimate: 64 bytes, allocs estimate: 2.

BenchmarkTools.Trial: 10000 samples with 1 evaluation.  # PR
 Range (min … max):  32.000 μs … 206.700 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     90.100 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   91.892 μs ±  24.194 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

                     ▂▂▄▆█▆█▆▆▅▅▃▄▄▃▄▂▂▁▁
  ▃▃▂▂▃▃▃▃▄▄▃▄▄▄▅▄▅▅▆████████████████████▇▇▆▆▅▅▅▄▄▄▄▃▃▃▃▃▂▂▂▂▂ ▄
  32 μs           Histogram: frequency by time          154 μs <

 Memory estimate: 64 bytes, allocs estimate: 2.

src/Filters/filt.jl

src/dspbase.jl

martinholters · 2024-12-17T06:38:40Z

Hm, I've tried with the same benchmarks and the results are somewhat hard to interpret, as it's close to measurement noise, but to me it looks like this:

On Julia 1.10.7, the PR is slightly better than master.
On Julia 1.11.2, master is slightly better than this PR.
But not doing the views thing, the PR is slightly better on Julia 1.11.2, too.

All of that might be dominated by randomness, though. Anyway, I've removed the rewrite to views to have this more focused. Can you re-do the benchmarks and see whether it's now an improvement for you, too?

wheeheee · 2024-12-17T10:21:16Z

Hm, not much difference without the views (only tried 1.11.2 for this comparison). If it matters, I gave them each a warmup run and afterwards they were pretty consistent. It might be machine-dependent, but I found an example that consistently shows a difference for me.

julia> @benchmark filt!(out, b, a, x) setup=((b, a) = (rand(15), [1.0; rand(29)]); x = rand(10_000); out = similar(x))
BenchmarkTools.Trial: 10000 samples with 1 evaluation.  # master
 Range (min … max):   87.700 μs … 280.400 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     102.500 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   106.822 μs ±  14.591 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▂      ▅█
  █▆▂▂▃▄▄███▇▆▄▅▆▅▅▄▄▃▃▂▃▄▄▄▄▃▆▄▅▄▃▃▃▃▂▃▄▂▂▂▂▂▂▂▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁ ▃
  87.7 μs          Histogram: frequency by time          150 μs <

 Memory estimate: 592 bytes, allocs estimate: 4.

BenchmarkTools.Trial: 10000 samples with 1 evaluation.  # PR
 Range (min … max):  115.000 μs … 436.600 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     140.700 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   146.303 μs ±  16.144 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▂          ▂▂▁▃▃▄█▅▆▄▃▂▁▃▃▂▂▂▃▂▃▂▁▁▁▁▂▂▂▂▂                    ▂
  █▅▆█▇▇▇▇▆███████████████████████████████████▇▇▇▇▇▆▆▆▆▅▅▅▅▅▅▄▆ █
  115 μs        Histogram: log(frequency) by time        200 μs <

 Memory estimate: 288 bytes, allocs estimate: 2.

OTOH it isn't so one-sided, for other coefficient lengths, the PR is better. So I guess there are just some tradeoffs to be made.

julia> @btime filt!(out, b, a, x) setup=((b, a) = (rand(5), [1.0; rand(29)]); x = rand(10_000); out = similar(x));
  87.800 μs (4 allocations: 592 bytes)  # master
  76.800 μs (2 allocations: 288 bytes)  # PR

EDIT: similar results on 1.10.7

martinholters · 2024-12-17T11:29:13Z

So here's my hypothesis to what's going on: The run-time of the loops only scales approximately linear wrt. iteration count, depending on how well the iteration count matches the vectorization width. So if max(length(a), length(b)) (before any padding) is a "good" length, but min(length(a), length(b)) is a "bad" length (and their difference is a "bad" length for the second loop, too), then this PR can lead to inferior performance.
This also seems to be system specific (not unexpected if my suspicion is correct), as for the example you have provided, I only see about 10% run-time increase compared to your 40%.

Overall, my feeling is that we shouldn't over-value these benchmarks. Users who really want to squeeze the last bits of performance out of filt! on their specific system can always just zero-pad their coefficients until they reach an optimum. For your example, I see improved performance replacing b with [rand(15); zeros(2)]. YMMV.

wheeheee

Fair enough. Then I guess this is good to go.

martinholters · 2025-01-09T12:10:37Z

@wheeheee I'm not too excited of the commit you've pushed. It accesses internals of LaurentPolynomial which AFAICT are non-public/undocumented, and the gain is not that much. Are you ok with me undoing that and then merging?

wheeheee · 2025-01-09T12:11:53Z

Yeah, no problem. Just thought it might be helpful, feel free to change.

Let _filt_iir! handle a, b of different length

878281a

wheeheee reviewed Dec 16, 2024

View reviewed changes

src/Filters/filt.jl Outdated Show resolved Hide resolved

wheeheee reviewed Dec 16, 2024

View reviewed changes

src/dspbase.jl Outdated Show resolved Hide resolved

martinholters force-pushed the mh/improve-_filt_iir branch from 6a1f2ef to 878281a Compare December 17, 2024 06:31

wheeheee approved these changes Dec 17, 2024

View reviewed changes

martinholters force-pushed the mh/improve-_filt_iir branch from e3f404b to 878281a Compare January 9, 2025 12:13

martinholters merged commit 73d58fb into master Jan 9, 2025
19 checks passed

martinholters deleted the mh/improve-_filt_iir branch January 9, 2025 13:06

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reduce allocations for IIR filtering #614

Reduce allocations for IIR filtering #614

martinholters commented Dec 16, 2024 •

edited

Loading

codecov bot commented Dec 16, 2024 •

edited

Loading

wheeheee commented Dec 16, 2024

martinholters commented Dec 17, 2024

wheeheee commented Dec 17, 2024 •

edited

Loading

martinholters commented Dec 17, 2024

wheeheee left a comment

martinholters commented Jan 9, 2025

wheeheee commented Jan 9, 2025

Reduce allocations for IIR filtering #614

Reduce allocations for IIR filtering #614

Conversation

martinholters commented Dec 16, 2024 • edited Loading

codecov bot commented Dec 16, 2024 • edited Loading

Codecov Report

wheeheee commented Dec 16, 2024

martinholters commented Dec 17, 2024

wheeheee commented Dec 17, 2024 • edited Loading

martinholters commented Dec 17, 2024

wheeheee left a comment

Choose a reason for hiding this comment

martinholters commented Jan 9, 2025

wheeheee commented Jan 9, 2025

martinholters commented Dec 16, 2024 •

edited

Loading

codecov bot commented Dec 16, 2024 •

edited

Loading

wheeheee commented Dec 17, 2024 •

edited

Loading