-
Notifications
You must be signed in to change notification settings - Fork 33
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
LowerTriangularMatrices: N-dimensional and GPU ready #503
Conversation
With this PR it seems that the roadmap is as follows
I don't really see a way to split up the last point into several PRs that'll just be a monster PR then again I guess, but that's probably fine. |
Yes this sounds reasonable. For the third point, in theory we could split it into multiple parts, by defining temporary aliases for these structures that allow to access the n-dimensional arrays like the vectors of vectors. It's also a question if the third point should already include proper GPU/Kernelabstraction kernels or not. |
I think it'd like to go step by step. Simply because changing the indexing structure may introduce errors that we could spot more easily if we don't also do GPU kernels with KernelAbstractions in one go. However, one could already come up with a set of allowed operations / recommended structure of how to write the kernels so that, while still everything is CPU-based and KernelAbstractions-independent we have a kernel that is super close to what it's supposed to be. So for example if you have a look at the kernels and say what we shouldn't do for the GPU then I'd be happy to write the kernels already for that in mind. |
The PR is coming along nicely, but the big bit that's missing is the broadcasting on GPU. I've had a quick look into it, and I am not quite certain to proceed there right now. To recap: Our For the broadcast that's a problem, as it wants to access the upper triangle that doesn't exists, and therefore quickly either causes What we want is that any broadcasting just is directly applied to all elements of the |
@vchuravy do you have some insights on how to define array broadcasting for the GPU? ☝🏼 |
I still get this with a 5x5x5 matrix julia> L[6,5,1] # should throw
ERROR: BoundsError: attempt to access 5×5×5 LowerTriangularArray{Float64, 3, Matrix{Float64}} at index [6, 5]
julia> L[5,5,6] # should also throw!
3.6338242925240716e174 don't know why the tests passed for you? EDIT: This is resolved below. |
I'm replacing your @inline function Base.getindex(L::LowerTriangularArray{T,N}, I::Vararg{Integer,N}) where {T,N}
i, j = I[1:2]
@boundscheck (0 < i <= L.m && 0 < j <= L.n) || throw(BoundsError(L, (i, j)))
j > i && return zero(getindex(L.data, 1, I[3:end]...)) # to get a zero element in the correct shape, we just take the zero element of some valid element, there are probably faster ways to do this, but I don't know how (espacially when ":" are used), and this is just a fallback anyway
k = ij2k(i, j, L.m)
@inbounds r = getindex(L.data, k, I[3:end]...)
return r
end with @inline function Base.getindex(L::LowerTriangularArray{T,N}, I::Vararg{Integer,N}) where {T,N}
i, j = I[1:2]
@boundscheck (0 < i <= L.m && 0 < j <= L.n) || throw(BoundsError(L, (i, j)))
@boundscheck j > i && return zero(T)
k = ij2k(i, j, L.m)
return getindex(L.data, k, I[3:end]...)
end
|
Yes, you are right! Thanks! |
Okay I think this is almost ready to go!! I still want to check that |
No they weren't, now all is corrected I believe. It's a bit weird, I needed some |
I also just learned that github actions always checks bounds regardless |
interesting! |
I don't think this is our intention (with julia> JL = adapt(JLArray, L)
5×5 LowerTriangularArray{Float64, 2, JLArray{Float64, 1}}:
0.533073 0.0 0.0 0.0 0.0
0.879798 0.921815 0.0 0.0 0.0
0.313754 0.224968 0.229314 0.0 0.0
0.811464 0.603529 0.325419 0.481392 0.0
0.64585 0.0579084 0.509436 0.163703 0.540856
julia> JL + JL # looks good
5×5 LowerTriangularArray{Float64, 2, JLArray{Float64, 1}}:
1.06615 0.0 0.0 0.0 0.0
1.7596 1.84363 0.0 0.0 0.0
0.627507 0.449935 0.458628 0.0 0.0
1.62293 1.20706 0.650837 0.962785 0.0
1.2917 0.115817 1.01887 0.327406 1.08171
julia> JL .+ JL # escapes and also where does the upper triangle come from?
5×5 JLArray{Float64, 2}:
1.06615 1.2917 1.20706 0.458628 0.650837
1.7596 1.84363 0.115817 0.650837 1.01887
0.627507 0.449935 0.458628 1.01887 0.962785
1.62293 1.20706 0.650837 0.962785 0.327406
1.2917 0.115817 1.01887 0.327406 1.08171 I think the upper triangle is from evaluating The corresponding operation for ring grids is fine though! |
I've removed the directly defined arithmetic for Similar to #520 I've defined two broadcasting styles, one for CPU julia> L .== L
ERROR: BoundsError: attempt to access 5×5 LowerTriangularMatrix{Bool} at index [1, 2] because this currently returns a julia> JL .== JL
5×5 LowerTriangularArray{Bool, 2, JLArray{Bool, 1}}:
1 0 0 0 0
1 1 0 0 0
1 1 1 0 0
1 1 1 1 0
1 1 1 1 1 which is probably not what one would expect maybe, but not too bad worst case. I've just changed all tests to use |
Next issue, julia> L
5×5 LowerTriangularMatrix{Float64}:
0.296704 0.0 0.0 0.0 0.0
0.398559 0.926045 0.0 0.0 0.0
0.958507 0.130742 0.300892 0.0 0.0
0.867598 0.842258 0.588318 0.977474 0.0
0.869255 0.989118 0.742295 0.500998 0.221757
julia> L2
5×5 LowerTriangularMatrix{ComplexF64}:
0.750678+0.558921im 0.0+0.0im … 0.0+0.0im 0.0+0.0im
0.825657+0.14191im 0.0197592+0.542655im 0.0+0.0im 0.0+0.0im
0.866794+0.687417im 0.55294+0.572049im 0.0+0.0im 0.0+0.0im
0.219007+0.697757im 0.127504+0.590993im 0.454295+0.691951im 0.0+0.0im
0.135136+0.433922im 0.668148+0.740522im 0.259046+0.285174im 0.21516+0.0390588im
julia> L2 .= L
ERROR: conflicting broadcast rules defined
Broadcast.BroadcastStyle(::Main.SpeedyWeather.LowerTriangularMatrices.LowerTriangularStyle{2, Vector{ComplexF64}}, ::Main.SpeedyWeather.LowerTriangularMatrices.LowerTriangularStyle{2, Vector{Float64}}) = Main.SpeedyWeather.LowerTriangularMatrices.LowerTriangularStyle{2, Vector{ComplexF64}}()
Broadcast.BroadcastStyle(::Main.SpeedyWeather.LowerTriangularMatrices.LowerTriangularStyle{2, Vector{Float64}}, ::Main.SpeedyWeather.LowerTriangularMatrices.LowerTriangularStyle{2, Vector{ComplexF64}}) = Main.SpeedyWeather.LowerTriangularMatrices.LowerTriangularStyle{2, Vector{Float64}}()
One of these should be undefined (and thus return Broadcast.Unknown). Can probably be resolved by making sure the type |
Resolved now by making the julia> L = ones(LowerTriangularMatrix{Float32},3,3)
3×3 LowerTriangularMatrix{Float32}:
1.0 0.0 0.0
1.0 1.0 0.0
1.0 1.0 1.0
julia> L2 = rand(LowerTriangularMatrix{ComplexF64}, 3, 3)
3×3 LowerTriangularMatrix{ComplexF64}:
0.406483+0.885732im 0.0+0.0im 0.0+0.0im
0.85852+0.0421888im 0.811017+0.694692im 0.0+0.0im
0.785031+0.28899im 0.123636+0.516475im 0.482183+0.623109im
julia> L + L2
3×3 LowerTriangularMatrix{ComplexF64}:
1.40648+0.885732im 0.0+0.0im 0.0+0.0im
1.85852+0.0421888im 1.81102+0.694692im 0.0+0.0im
1.78503+0.28899im 1.12364+0.516475im 1.48218+0.623109im |
Great that the PR is merged! Just one thing: as we discussed here (or somewhere else?) before, broadcasting is in this case often slower than the direct arithmetic functions that we defined for ( |
Yes, good call. I guess you really mean writing down the loops not just propagating the broadcasting to |
I mean as it was done before and removed in 20b0938 This is the old test I did (copied from our Zulip conversation) that used those functions using SpeedyWeather, BenchmarkTools
NF = Float32
L_cpu = randn(LowerTriangularArray{NF}, 100, 100, 10)
julia> @benchmark L_cpu .* 4.5f0
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
Range (min … max): 103.897 μs … 17.866 ms ┊ GC (min … max): 0.00% … 98.21%
Time (median): 268.755 μs ┊ GC (median): 0.00%
Time (mean ± σ): 305.518 μs ± 557.126 μs ┊ GC (mean ± σ): 11.72% ± 6.48%
▆ ▃▆███▇▅▄▃▃▂▂▃▂▂▂▂▃▃▂▂▂▁▁ ▁ ▃
███▇▇▇▆▃▆▅▆▅▅▃▄▅▄▅▃▅▃▄▄▁▅██████████████████████████████▇▆▇▅▄▅ █
104 μs Histogram: log(frequency) by time 451 μs <
Memory estimate: 390.78 KiB, allocs estimate: 4.
julia> @benchmark L_cpu * 4.5f0
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
Range (min … max): 14.140 μs … 18.110 ms ┊ GC (min … max): 0.00% … 99.06%
Time (median): 88.011 μs ┊ GC (median): 0.00%
Time (mean ± σ): 112.584 μs ± 483.033 μs ┊ GC (mean ± σ): 19.44% ± 4.72%
▄▃ ▄▇██▇▆▆▆▅▄▃▂▁▁ ▁▁▁ ▁ ▁ ▃
██▆▄▆▆▁▄▅▅▃▁▁▃▄▄▁▃▁▁▁▄▅████████████████████████▇▇▇▇██▇█▇█▇▇▇▆ █
14.1 μs Histogram: log(frequency) by time 182 μs <
Memory estimate: 197.39 KiB, allocs estimate: 3. |
Of course yes, they should be added in, I forgot that there's such a difference! |
So, here’s a first draft for the rework of the LowerTriangularMatrices to be GPU compatible and N-dimensional.
I implemented it to be agnostic of the actual array type and also avoided any form of custom kernels. All
for
-loops are gone and replaced with operations on thedata
object or some fancy indexing in some parts. In the current version I kept some of the old functions and restrict them, so that they are only used for the CPU LowerTriangularMatrix, because they might be faster (not checked).I kept all old tests in order to make sure that everything still works exactly as before with the old syntax as well, as long as one uses
LowerTriangularMatrix
as a type. I did at a lot of new tests, and in the end we might also merge a few of these tests to keep the test suite a bit cleaner. GPU tests are done with JLArrays.There’s still some things left to do:
dotview
or maybe also some additionalgetindex
/view
-like functions to reduce/avoid any form of allocations when the model calls individual layers (this will probably be done later)copyto!
function (doesn't work with JLArrays, but should with Array and CuArray)