Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LowerTriangularMatrices: N-dimensional and GPU ready #503

Merged
merged 46 commits into from
Apr 25, 2024
Merged

Conversation

maximilian-gelbrecht
Copy link
Member

@maximilian-gelbrecht maximilian-gelbrecht commented Mar 25, 2024

So, here’s a first draft for the rework of the LowerTriangularMatrices to be GPU compatible and N-dimensional.

I implemented it to be agnostic of the actual array type and also avoided any form of custom kernels. All for-loops are gone and replaced with operations on the data object or some fancy indexing in some parts. In the current version I kept some of the old functions and restrict them, so that they are only used for the CPU LowerTriangularMatrix, because they might be faster (not checked).

I kept all old tests in order to make sure that everything still works exactly as before with the old syntax as well, as long as one uses LowerTriangularMatrix as a type. I did at a lot of new tests, and in the end we might also merge a few of these tests to keep the test suite a bit cleaner. GPU tests are done with JLArrays.

There’s still some things left to do:

  • N-dim broadcasting on CPU
  • broadcasting on GPU
  • more tests for GPU broadcasting
  • Docstring updates
  • Documentation updates
  • Implement additional function that might be needed, like e.g. dotview or maybe also some additional getindex/view-like functions to reduce/avoid any form of allocations when the model calls individual layers (this will probably be done later)
  • Most To-Do notes
  • Some minor To-Do notes left
  • test truncating copyto! function (doesn't work with JLArrays, but should with Array and CuArray)
  • Actually test it on GPU once, not just with JLArrays (not in CI, just manually)
  • Make sure all tests pass, not just those of the submodule
  • Benchmark

@milankl milankl added gpu 🖼️ Everthing GPU related array types 🔢 LowerTriangularMatrices and RingGrids labels Mar 25, 2024
Project.toml Outdated Show resolved Hide resolved
src/LowerTriangularMatrices/lower_triangular_matrix.jl Outdated Show resolved Hide resolved
src/LowerTriangularMatrices/lower_triangular_matrix.jl Outdated Show resolved Hide resolved
src/LowerTriangularMatrices/lower_triangular_matrix.jl Outdated Show resolved Hide resolved
src/LowerTriangularMatrices/lower_triangular_matrix.jl Outdated Show resolved Hide resolved
src/LowerTriangularMatrices/lower_triangular_matrix.jl Outdated Show resolved Hide resolved
src/LowerTriangularMatrices/lower_triangular_matrix.jl Outdated Show resolved Hide resolved
src/LowerTriangularMatrices/lower_triangular_matrix.jl Outdated Show resolved Hide resolved
src/LowerTriangularMatrices/lower_triangular_matrix.jl Outdated Show resolved Hide resolved
@milankl
Copy link
Member

milankl commented Mar 26, 2024

With this PR it seems that the roadmap is as follows

  • Implement LowerTriangularArray so that the current LowerTriangularMatrix is the 2D special case of it (this PR)
  • Implement the ring grids like OctahedralGaussianArray so that the current OctahedralGaussianGrid is the 2D special case of it (another PR)
  • Replace the current layered structure of the prognostic and diagnostic variables with 3D/4D LowerTriangularArray and 2D/3D OctahedralGaussianArray (or other ring grids) and adapt the kernels accordingly (another PR)

I don't really see a way to split up the last point into several PRs that'll just be a monster PR then again I guess, but that's probably fine.

@maximilian-gelbrecht
Copy link
Member Author

Yes this sounds reasonable.

For the third point, in theory we could split it into multiple parts, by defining temporary aliases for these structures that allow to access the n-dimensional arrays like the vectors of vectors.

It's also a question if the third point should already include proper GPU/Kernelabstraction kernels or not.

@milankl
Copy link
Member

milankl commented Mar 26, 2024

It's also a question if the third point should already include proper GPU/Kernelabstraction kernels or not.

I think it'd like to go step by step. Simply because changing the indexing structure may introduce errors that we could spot more easily if we don't also do GPU kernels with KernelAbstractions in one go. However, one could already come up with a set of allowed operations / recommended structure of how to write the kernels so that, while still everything is CPU-based and KernelAbstractions-independent we have a kernel that is super close to what it's supposed to be.

So for example if you have a look at the kernels and say what we shouldn't do for the GPU then I'd be happy to write the kernels already for that in mind.

@maximilian-gelbrecht
Copy link
Member Author

maximilian-gelbrecht commented Mar 28, 2024

The PR is coming along nicely, but the big bit that's missing is the broadcasting on GPU. I've had a quick look into it, and I am not quite certain to proceed there right now. To recap:

Our LowerTriangleArray stores only the lower triangle of an array/matrix by flattening the first two dimensions. It wraps around an array data::AbstractArray{T,N-1}, but is a subtype of AbstractArray{T,N}. By defining custom getindex we can either index it with N-1 indices by directly indexing the data Array, or by converting N indices appropriately, so that it can actually behave like a AbstractArray{T,N} in most situations.

For the broadcast that's a problem, as it wants to access the upper triangle that doesn't exists, and therefore quickly either causes BoundsErrors or incorrect results. On CPU I was able to fix this by defining my own broadcast style and extending the copyto!(dest::LowerTriangularArray, bc::Broadcasted) function, so that it correctly copies over only the lower triangle. On GPU, I looked into what's implemented over at GPUArrays.jl, but I am not quite certain if we have to implement a copyto!/_copyto! with a new kernel here, or if it's better to extent one of the other functions related to the broadcasting and the broadcast style that's defined by GPUArrays.jl. Is it possible to change the axes of the object just for the broadcasting? There's the Broadcast.preprocess function. Maybe that's something that could work?

What we want is that any broadcasting just is directly applied to all elements of the data::AbstractArray{T,N-1} object LowerTriangleArray wraps around, while ideallyLowerTriangleArray <: AbstractArray{T,N}.

@milankl
Copy link
Member

milankl commented Mar 28, 2024

@vchuravy do you have some insights on how to define array broadcasting for the GPU? ☝🏼

@milankl milankl mentioned this pull request Apr 1, 2024
@milankl
Copy link
Member

milankl commented Apr 17, 2024

I still get this with a 5x5x5 matrix

julia> L[6,5,1]    # should throw
ERROR: BoundsError: attempt to access 5×5×5 LowerTriangularArray{Float64, 3, Matrix{Float64}} at index [6, 5]

julia> L[5,5,6]    # should also throw!
3.6338242925240716e174

don't know why the tests passed for you?

EDIT: This is resolved below.

@milankl
Copy link
Member

milankl commented Apr 17, 2024

I'm replacing your

@inline function Base.getindex(L::LowerTriangularArray{T,N}, I::Vararg{Integer,N}) where {T,N}
    i, j = I[1:2]
    @boundscheck (0 < i <= L.m && 0 < j <= L.n) || throw(BoundsError(L, (i, j)))
    j > i && return zero(getindex(L.data, 1, I[3:end]...)) # to get a zero element in the correct shape, we just take the zero element of some valid element, there are probably faster ways to do this, but I don't know how (espacially when ":" are used), and this is just a fallback anyway 
    k = ij2k(i, j, L.m)
    @inbounds r = getindex(L.data, k, I[3:end]...)
    return r
end

with

@inline function Base.getindex(L::LowerTriangularArray{T,N}, I::Vararg{Integer,N}) where {T,N}
    i, j = I[1:2]
    @boundscheck (0 < i <= L.m && 0 < j <= L.n) || throw(BoundsError(L, (i, j)))
    @boundscheck j > i && return zero(T)
    k = ij2k(i, j, L.m)
    return getindex(L.data, k, I[3:end]...)
end
  • because of the @inline the @inbounds from the caller with propagate to the getindex(L.data. ...) with the @inbounds there's no check that the I indices are in bounds!
  • I've added @boundscheck around the j > i case which is then compiled away when using @inbounds. Zero should be returned for [1,2] for example, but I think we can expect people to only access the lower triangle in an @inbounds case, that also saves one > check for performance
  • given arguments here are all integers the returned type should be T e.g. Float32 and there is no need to request the data element at that index?

@maximilian-gelbrecht
Copy link
Member Author

Yes, you are right! Thanks!

@milankl
Copy link
Member

milankl commented Apr 19, 2024

Okay I think this is almost ready to go!! I still want to check that @inbounds is propagated correctly. Also the module is still called LowerTriangularMatrices but I think we sould keep that to highlight it's for lower triangular matrices even if we stack them up in n-dimensional arrays not that someone thinks the lower triangle becomes a quarter of a pyramid in 3D!

@milankl
Copy link
Member

milankl commented Apr 19, 2024

I still want to check that @inbounds is propagated correctly

No they weren't, now all is corrected I believe. It's a bit weird, I needed some Base.@propagate_inbounds (which also inlines) on getindex but not on setindex!. But now always testing with [1,2] which is an invalid index hence returning 0 for getindex or throwing a boundserror for setindex! but with @inbounds the ij2k function actually just returns [m,1] which is wrong but a valid index, so it's a neat way to test whether the @boundscheck is complied away or not (actually reaching outside of the memory for tests isn't a good idea...)

@milankl
Copy link
Member

milankl commented Apr 19, 2024

I also just learned that github actions always checks bounds regardless @inbounds annotations so I've commented those tests but they pass locally.

@navidcy
Copy link
Collaborator

navidcy commented Apr 22, 2024

I also just learned that github actions always checks bounds regardless @inbounds annotations so I've commented those tests but they pass locally.

interesting!

@milankl
Copy link
Member

milankl commented Apr 22, 2024

I don't think this is our intention (with Array or JLArray)

julia> JL = adapt(JLArray, L)
5×5 LowerTriangularArray{Float64, 2, JLArray{Float64, 1}}:
 0.533073  0.0        0.0       0.0       0.0
 0.879798  0.921815   0.0       0.0       0.0
 0.313754  0.224968   0.229314  0.0       0.0
 0.811464  0.603529   0.325419  0.481392  0.0
 0.64585   0.0579084  0.509436  0.163703  0.540856

julia> JL + JL    # looks good
5×5 LowerTriangularArray{Float64, 2, JLArray{Float64, 1}}:
 1.06615   0.0       0.0       0.0       0.0
 1.7596    1.84363   0.0       0.0       0.0
 0.627507  0.449935  0.458628  0.0       0.0
 1.62293   1.20706   0.650837  0.962785  0.0
 1.2917    0.115817  1.01887   0.327406  1.08171

julia> JL .+ JL    # escapes and also where does the upper triangle come from?
5×5 JLArray{Float64, 2}:
 1.06615   1.2917    1.20706   0.458628  0.650837
 1.7596    1.84363   0.115817  0.650837  1.01887
 0.627507  0.449935  0.458628  1.01887   0.962785
 1.62293   1.20706   0.650837  0.962785  0.327406
 1.2917    0.115817  1.01887   0.327406  1.08171

I think the upper triangle is from evaluating JL[i,j] for j > i with @inbounds which maps (incorrectly because the indices aren't in bounds) back to indices in the lower triangle

The corresponding operation for ring grids is fine though!

@milankl
Copy link
Member

milankl commented Apr 23, 2024

I've removed the directly defined arithmetic for LowerTriangularMatrix as they were obscuring incorrectly working broadcasting. Now it's done via broadcasting and works largely except for:

Similar to #520 I've defined two broadcasting styles, one for CPU <: Broadcast.AbstractArrayStyle{N} and one for GPU <: GPUArrays.AbstractGPUArrayStyle{N} which solves the problem from above. There is however, still a problem with non-zero preserving operations like .==

julia> L .== L
ERROR: BoundsError: attempt to access 5×5 LowerTriangularMatrix{Bool} at index [1, 2]

because this currently returns a LowerTriangularMatrix although it should be a BitMatrix (where a 1 can be written into the upper triangle). The corresponding operation with JLArrays works though (because the indexing is done differently only hitting the lower triangle)

julia> JL .== JL
5×5 LowerTriangularArray{Bool, 2, JLArray{Bool, 1}}:
 1  0  0  0  0
 1  1  0  0  0
 1  1  1  0  0
 1  1  1  1  0
 1  1  1  1  1

which is probably not what one would expect maybe, but not too bad worst case. I've just changed all tests to use L == L or JL == JL instead which works fine.

@milankl
Copy link
Member

milankl commented Apr 23, 2024

Next issue,

julia> L
5×5 LowerTriangularMatrix{Float64}:
 0.296704  0.0       0.0       0.0       0.0
 0.398559  0.926045  0.0       0.0       0.0
 0.958507  0.130742  0.300892  0.0       0.0
 0.867598  0.842258  0.588318  0.977474  0.0
 0.869255  0.989118  0.742295  0.500998  0.221757

julia> L2
5×5 LowerTriangularMatrix{ComplexF64}:
 0.750678+0.558921im        0.0+0.0im              0.0+0.0im           0.0+0.0im
 0.825657+0.14191im   0.0197592+0.542655im          0.0+0.0im           0.0+0.0im
 0.866794+0.687417im    0.55294+0.572049im          0.0+0.0im           0.0+0.0im
 0.219007+0.697757im   0.127504+0.590993im     0.454295+0.691951im      0.0+0.0im
 0.135136+0.433922im   0.668148+0.740522im     0.259046+0.285174im  0.21516+0.0390588im

julia> L2 .= L
ERROR: conflicting broadcast rules defined
  Broadcast.BroadcastStyle(::Main.SpeedyWeather.LowerTriangularMatrices.LowerTriangularStyle{2, Vector{ComplexF64}}, ::Main.SpeedyWeather.LowerTriangularMatrices.LowerTriangularStyle{2, Vector{Float64}}) = Main.SpeedyWeather.LowerTriangularMatrices.LowerTriangularStyle{2, Vector{ComplexF64}}()
  Broadcast.BroadcastStyle(::Main.SpeedyWeather.LowerTriangularMatrices.LowerTriangularStyle{2, Vector{Float64}}, ::Main.SpeedyWeather.LowerTriangularMatrices.LowerTriangularStyle{2, Vector{ComplexF64}}) = Main.SpeedyWeather.LowerTriangularMatrices.LowerTriangularStyle{2, Vector{Float64}}()
One of these should be undefined (and thus return Broadcast.Unknown).

Can probably be resolved by making sure the type T in the style isn't parametric...

@milankl
Copy link
Member

milankl commented Apr 24, 2024

Resolved now by making the ArrayType in the broadcasting style non-parametric

julia> L = ones(LowerTriangularMatrix{Float32},3,3)
3×3 LowerTriangularMatrix{Float32}:
 1.0  0.0  0.0
 1.0  1.0  0.0
 1.0  1.0  1.0

julia> L2 = rand(LowerTriangularMatrix{ComplexF64}, 3, 3)
3×3 LowerTriangularMatrix{ComplexF64}:
 0.406483+0.885732im        0.0+0.0im            0.0+0.0im
  0.85852+0.0421888im  0.811017+0.694692im       0.0+0.0im
 0.785031+0.28899im    0.123636+0.516475im  0.482183+0.623109im

julia> L + L2
3×3 LowerTriangularMatrix{ComplexF64}:
 1.40648+0.885732im       0.0+0.0im           0.0+0.0im
 1.85852+0.0421888im  1.81102+0.694692im      0.0+0.0im
 1.78503+0.28899im    1.12364+0.516475im  1.48218+0.623109im

@milankl milankl merged commit 00a2e27 into main Apr 25, 2024
4 checks passed
@milankl milankl deleted the mg/ltm-gpu-ndim branch April 25, 2024 21:37
@maximilian-gelbrecht
Copy link
Member Author

Great that the PR is merged!

Just one thing: as we discussed here (or somewhere else?) before, broadcasting is in this case often slower than the direct arithmetic functions that we defined for (+,*,scale!) etc. If we intend to use this in the dynamical core, it might be worth to add those functions again.

@milankl
Copy link
Member

milankl commented May 2, 2024

Yes, good call. I guess you really mean writing down the loops not just propagating the broadcasting to .data?

@maximilian-gelbrecht
Copy link
Member Author

maximilian-gelbrecht commented May 2, 2024

I mean as it was done before and removed in 20b0938

This is the old test I did (copied from our Zulip conversation) that used those functions

using SpeedyWeather, BenchmarkTools
NF = Float32
L_cpu = randn(LowerTriangularArray{NF}, 100, 100, 10)

julia> @benchmark L_cpu .* 4.5f0
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min  max):  103.897 μs   17.866 ms  ┊ GC (min  max):  0.00%  98.21%
 Time  (median):     268.755 μs               ┊ GC (median):     0.00%
 Time  (mean ± σ):   305.518 μs ± 557.126 μs  ┊ GC (mean ± σ):  11.72% ±  6.48%

  ▆                        ▃▆███▇▅▄▃▃▂▂▃▂▂▂▂▃▃▂▂▂▁▁ ▁           ▃
  ███▇▇▇▆▃▆▅▆▅▅▃▄▅▄▅▃▅▃▄▄▁▅██████████████████████████████▇▆▇▅▄▅ █
  104 μs        Histogram: log(frequency) by time        451 μs <

 Memory estimate: 390.78 KiB, allocs estimate: 4.

julia> @benchmark L_cpu * 4.5f0
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min  max):   14.140 μs   18.110 ms  ┊ GC (min  max):  0.00%  99.06%
 Time  (median):      88.011 μs               ┊ GC (median):     0.00%
 Time  (mean ± σ):   112.584 μs ± 483.033 μs  ┊ GC (mean ± σ):  19.44% ±  4.72%

  ▄▃                     ▄▇██▇▆▆▆▅▄▃▂▁▁ ▁▁▁  ▁ ▁                ▃
  ██▆▄▆▆▁▄▅▅▃▁▁▃▄▄▁▃▁▁▁▄▅████████████████████████▇▇▇▇██▇█▇█▇▇▇▆ █
  14.1 μs       Histogram: log(frequency) by time        182 μs <

 Memory estimate: 197.39 KiB, allocs estimate: 3.

@milankl
Copy link
Member

milankl commented May 2, 2024

Of course yes, they should be added in, I forgot that there's such a difference!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
array types 🔢 LowerTriangularMatrices and RingGrids gpu 🖼️ Everthing GPU related
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants