Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Direct colvolution #292

Closed
galenlynch opened this issue Apr 28, 2019 · 5 comments · Fixed by #545
Closed

Direct colvolution #292

galenlynch opened this issue Apr 28, 2019 · 5 comments · Fixed by #545

Comments

@galenlynch
Copy link
Member

It would be great if we had a direct convolution kernel, which would probably be faster for small convolutions.

@ssfrr
Copy link
Contributor

ssfrr commented Apr 28, 2019

We have a time-domain implementation as part of filt, as well as a heuristic for when to do time-domain or FFT. Seems like we should unify filt and conv, and then combine the heuristics to pick from the 3 methods (time-domain, full FFT, and overlap-add FFT).

@galenlynch
Copy link
Member Author

galenlynch commented Apr 29, 2019

Yeah I 100% agree, my main motivation for adding overlp-save to conv was to pave the way to combine filt and conv, at least for FIR filtering. However, the devil is in the details: extrapolating the 1D direct convolution in filt to ND with abstract arrays seems like it would be just as fiddly as overlap save was.

@martinholters
Copy link
Member

I'd really love to have a conv fallback that is a generic as possible, being usable with any type combination that supports * and +, i.e. letting collect figure out the result element type. I had made a start somewhere, but hadn't really pursued it. Maybe I can give it another shot. But we might still want an implementation for a limited set of types that is potentially faster.

@martinholters
Copy link
Member

So here is a potential fallback implementation:

julia> function directconv(A::AbstractArray{<:Any,M}, B::AbstractArray{<:Any,N}) where {M,N}
           axes_A = ntuple(i -> axes(A, i), Val(max(M,N)))
           axes_B = ntuple(i -> axes(B, i), Val(max(M,N)))
           krange(i) = CartesianIndices((:).(max.(first.(axes_B), Tuple(i) .- last.(axes_A)), min.(last.(axes_B), Tuple(i) .- first.(axes_A))))
           return [sum(A[i-k]*B[k] for k in krange(i)) for i in CartesianIndices((:).(first.(axes_A).+first.(axes_B), last.(axes_A).+last.(axes_B)))]
       end
directconv (generic function with 1 method)

julia> A = rand(3,4); B = rand(5,6,7);

julia> @btime conv($A,$B);
  112.307 μs (178 allocations: 72.63 KiB)

julia> @btime directconv($A,$B); 
  20.673 μs (885 allocations: 52.00 KiB)

From a brief glance, it's type-stable and results agree with conv. The huge number of allocations are due to the generator inside sum which needs to capture A and B and hence has to be heap allocated.

@galenlynch
Copy link
Member Author

galenlynch commented Apr 30, 2019

I figured out a fast way to do direct convolution that uses SIMD. It doesn't support weird indexing at the moment, or arrays of different number of dimensions, but it's fast.

julia> function _conv_kern_direct!(out, u, v)
    fill!(out, 0)
    u_region = CartesianIndices(u)
    v_region = CartesianIndices(v)
    one_index = oneunit(first(u_region))
    for vindex in v_region
        @simd for uindex in u_region
            @inbounds out[uindex + vindex - one_index] += u[uindex] * v[vindex]
        end
    end
    out
end

function _conv_kern_direct(
    u::AbstractArray{T, N}, v::AbstractArray{S, N}, su, sv) where {T, S, N}
    sout = su .+ sv .- 1
    out = similar(u, promote_type(T, S), sout)
    _conv_kern_direct!(out, u, v)
end

julia> sa = (2,3,4); sb = (5,6,7); a = rand(sa...); b = rand(sb...);


julia> @benchmark conv($b, $a)
BenchmarkTools.Trial: 
  memory estimate:  27.78 KiB
  allocs estimate:  164
  --------------
  minimum time:     86.637 μs (0.00% GC)
  median time:      99.368 μs (0.00% GC)
  mean time:        143.282 μs (7.74% GC)
  maximum time:     67.902 ms (95.95% GC)
  --------------
  samples:          10000
  evals/sample:     1

julia> @benchmark directconv($b, $a)
BenchmarkTools.Trial: 
  memory estimate:  56.52 KiB
  allocs estimate:  963
  --------------
  minimum time:     25.700 μs (0.00% GC)
  median time:      27.732 μs (0.00% GC)
  mean time:        39.263 μs (22.48% GC)
  maximum time:     5.374 ms (99.37% GC)
  --------------
  samples:          10000
  evals/sample:     1


julia> @benchmark _conv_kern_direct($b, $a, $sb, $sa)
BenchmarkTools.Trial: 
  memory estimate:  3.88 KiB
  allocs estimate:  1
  --------------
  minimum time:     5.357 μs (0.00% GC)
  median time:      5.909 μs (0.00% GC)
  mean time:        8.284 μs (18.72% GC)
  maximum time:     11.008 ms (99.89% GC)
  --------------
  samples:          10000
  evals/sample:     6

It's also competitive with conv for larger inputs:

julia> sa = (1000, 1000, 3); sb = (3, 3, 3); a = rand(sa...); b = rand(sb...);

julia> @btime conv($a, $b);
  135.281 ms (165 allocations: 38.32 MiB)

julia> @btime directconv($a, $b);
  498.466 ms (10040044 allocations: 574.50 MiB)

julia> @btime  _conv_kern_direct($a, $b, $sa, $sb);
  146.988 ms (2 allocations: 38.30 MiB)

This is also with the overlap-save version of conv, meaning the same convolution with the FFT-only version of conv would be slower:

julia> using DSP: _conv_kern_fft!, nextfastfft

julia> sout = sa .+ sb .- 1; out = zeros(sout); out = zeros(sout); nffts = nextfastfft(sout);

julia> @btime _conv_kern_fft!($out, $a, $b, $sa, $sb, $sout, $nffts);
  348.628 ms (172 allocations: 232.88 MiB)

Because it's using SIMD, you get a ~2x speedup for single precision float, and a ~4x speedup for Int16 etc.

Also with the same test data:

julia> using ImageFiltering

julia> @btime imfilter($a, ($b,));
  171.826 ms (31 allocations: 68.96 MiB)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants