LowerTriangularMatrices: N-dimensional and GPU ready #503

maximilian-gelbrecht · 2024-03-25T14:18:02Z

So, here’s a first draft for the rework of the LowerTriangularMatrices to be GPU compatible and N-dimensional.

I implemented it to be agnostic of the actual array type and also avoided any form of custom kernels. All for-loops are gone and replaced with operations on the data object or some fancy indexing in some parts. In the current version I kept some of the old functions and restrict them, so that they are only used for the CPU LowerTriangularMatrix, because they might be faster (not checked).

I kept all old tests in order to make sure that everything still works exactly as before with the old syntax as well, as long as one uses LowerTriangularMatrix as a type. I did at a lot of new tests, and in the end we might also merge a few of these tests to keep the test suite a bit cleaner. GPU tests are done with JLArrays.

There’s still some things left to do:

Project.toml

src/LowerTriangularMatrices/lower_triangular_matrix.jl

milankl · 2024-03-26T13:34:57Z

With this PR it seems that the roadmap is as follows

Implement LowerTriangularArray so that the current LowerTriangularMatrix is the 2D special case of it (this PR)
Implement the ring grids like OctahedralGaussianArray so that the current OctahedralGaussianGrid is the 2D special case of it (another PR)
Replace the current layered structure of the prognostic and diagnostic variables with 3D/4D LowerTriangularArray and 2D/3D OctahedralGaussianArray (or other ring grids) and adapt the kernels accordingly (another PR)

I don't really see a way to split up the last point into several PRs that'll just be a monster PR then again I guess, but that's probably fine.

maximilian-gelbrecht · 2024-03-26T13:43:04Z

Yes this sounds reasonable.

For the third point, in theory we could split it into multiple parts, by defining temporary aliases for these structures that allow to access the n-dimensional arrays like the vectors of vectors.

It's also a question if the third point should already include proper GPU/Kernelabstraction kernels or not.

src/LowerTriangularMatrices/lower_triangular_matrix.jl

milankl · 2024-03-26T13:59:47Z

It's also a question if the third point should already include proper GPU/Kernelabstraction kernels or not.

I think it'd like to go step by step. Simply because changing the indexing structure may introduce errors that we could spot more easily if we don't also do GPU kernels with KernelAbstractions in one go. However, one could already come up with a set of allowed operations / recommended structure of how to write the kernels so that, while still everything is CPU-based and KernelAbstractions-independent we have a kernel that is super close to what it's supposed to be.

So for example if you have a look at the kernels and say what we shouldn't do for the GPU then I'd be happy to write the kernels already for that in mind.

src/LowerTriangularMatrices/lower_triangular_matrix.jl

maximilian-gelbrecht · 2024-03-28T07:17:41Z

The PR is coming along nicely, but the big bit that's missing is the broadcasting on GPU. I've had a quick look into it, and I am not quite certain to proceed there right now. To recap:

Our LowerTriangleArray stores only the lower triangle of an array/matrix by flattening the first two dimensions. It wraps around an array data::AbstractArray{T,N-1}, but is a subtype of AbstractArray{T,N}. By defining custom getindex we can either index it with N-1 indices by directly indexing the data Array, or by converting N indices appropriately, so that it can actually behave like a AbstractArray{T,N} in most situations.

For the broadcast that's a problem, as it wants to access the upper triangle that doesn't exists, and therefore quickly either causes BoundsErrors or incorrect results. On CPU I was able to fix this by defining my own broadcast style and extending the copyto!(dest::LowerTriangularArray, bc::Broadcasted) function, so that it correctly copies over only the lower triangle. On GPU, I looked into what's implemented over at GPUArrays.jl, but I am not quite certain if we have to implement a copyto!/_copyto! with a new kernel here, or if it's better to extent one of the other functions related to the broadcasting and the broadcast style that's defined by GPUArrays.jl. Is it possible to change the axes of the object just for the broadcasting? There's the Broadcast.preprocess function. Maybe that's something that could work?

What we want is that any broadcasting just is directly applied to all elements of the data::AbstractArray{T,N-1} object LowerTriangleArray wraps around, while ideallyLowerTriangleArray <: AbstractArray{T,N}.

milankl · 2024-03-28T17:31:37Z

@vchuravy do you have some insights on how to define array broadcasting for the GPU? ☝🏼

milankl · 2024-04-17T20:09:52Z

I still get this with a 5x5x5 matrix

julia> L[6,5,1]    # should throw
ERROR: BoundsError: attempt to access 5×5×5 LowerTriangularArray{Float64, 3, Matrix{Float64}} at index [6, 5]

julia> L[5,5,6]    # should also throw!
3.6338242925240716e174

don't know why the tests passed for you?

EDIT: This is resolved below.

milankl · 2024-04-17T20:24:56Z

I'm replacing your

@inline function Base.getindex(L::LowerTriangularArray{T,N}, I::Vararg{Integer,N}) where {T,N}
    i, j = I[1:2]
    @boundscheck (0 < i <= L.m && 0 < j <= L.n) || throw(BoundsError(L, (i, j)))
    j > i && return zero(getindex(L.data, 1, I[3:end]...)) # to get a zero element in the correct shape, we just take the zero element of some valid element, there are probably faster ways to do this, but I don't know how (espacially when ":" are used), and this is just a fallback anyway 
    k = ij2k(i, j, L.m)
    @inbounds r = getindex(L.data, k, I[3:end]...)
    return r
end

with

@inline function Base.getindex(L::LowerTriangularArray{T,N}, I::Vararg{Integer,N}) where {T,N}
    i, j = I[1:2]
    @boundscheck (0 < i <= L.m && 0 < j <= L.n) || throw(BoundsError(L, (i, j)))
    @boundscheck j > i && return zero(T)
    k = ij2k(i, j, L.m)
    return getindex(L.data, k, I[3:end]...)
end

because of the @inline the @inbounds from the caller with propagate to the getindex(L.data. ...) with the @inbounds there's no check that the I indices are in bounds!
I've added @boundscheck around the j > i case which is then compiled away when using @inbounds. Zero should be returned for [1,2] for example, but I think we can expect people to only access the lower triangle in an @inbounds case, that also saves one > check for performance
given arguments here are all integers the returned type should be T e.g. Float32 and there is no need to request the data element at that index?

maximilian-gelbrecht · 2024-04-19T09:25:06Z

Yes, you are right! Thanks!

milankl · 2024-04-19T16:03:22Z

Okay I think this is almost ready to go!! I still want to check that @inbounds is propagated correctly. Also the module is still called LowerTriangularMatrices but I think we sould keep that to highlight it's for lower triangular matrices even if we stack them up in n-dimensional arrays not that someone thinks the lower triangle becomes a quarter of a pyramid in 3D!

milankl · 2024-04-19T21:21:01Z

I still want to check that @inbounds is propagated correctly

No they weren't, now all is corrected I believe. It's a bit weird, I needed some Base.@propagate_inbounds (which also inlines) on getindex but not on setindex!. But now always testing with [1,2] which is an invalid index hence returning 0 for getindex or throwing a boundserror for setindex! but with @inbounds the ij2k function actually just returns [m,1] which is wrong but a valid index, so it's a neat way to test whether the @boundscheck is complied away or not (actually reaching outside of the memory for tests isn't a good idea...)

milankl · 2024-04-19T21:43:14Z

I also just learned that github actions always checks bounds regardless @inbounds annotations so I've commented those tests but they pass locally.

navidcy · 2024-04-22T18:02:27Z

I also just learned that github actions always checks bounds regardless @inbounds annotations so I've commented those tests but they pass locally.

interesting!

milankl · 2024-04-22T20:00:33Z

I don't think this is our intention (with Array or JLArray)

julia> JL = adapt(JLArray, L)
5×5 LowerTriangularArray{Float64, 2, JLArray{Float64, 1}}:
 0.533073  0.0        0.0       0.0       0.0
 0.879798  0.921815   0.0       0.0       0.0
 0.313754  0.224968   0.229314  0.0       0.0
 0.811464  0.603529   0.325419  0.481392  0.0
 0.64585   0.0579084  0.509436  0.163703  0.540856

julia> JL + JL    # looks good
5×5 LowerTriangularArray{Float64, 2, JLArray{Float64, 1}}:
 1.06615   0.0       0.0       0.0       0.0
 1.7596    1.84363   0.0       0.0       0.0
 0.627507  0.449935  0.458628  0.0       0.0
 1.62293   1.20706   0.650837  0.962785  0.0
 1.2917    0.115817  1.01887   0.327406  1.08171

julia> JL .+ JL    # escapes and also where does the upper triangle come from?
5×5 JLArray{Float64, 2}:
 1.06615   1.2917    1.20706   0.458628  0.650837
 1.7596    1.84363   0.115817  0.650837  1.01887
 0.627507  0.449935  0.458628  1.01887   0.962785
 1.62293   1.20706   0.650837  0.962785  0.327406
 1.2917    0.115817  1.01887   0.327406  1.08171

I think the upper triangle is from evaluating JL[i,j] for j > i with @inbounds which maps (incorrectly because the indices aren't in bounds) back to indices in the lower triangle

The corresponding operation for ring grids is fine though!

…return false

milankl · 2024-04-23T23:07:48Z

I've removed the directly defined arithmetic for LowerTriangularMatrix as they were obscuring incorrectly working broadcasting. Now it's done via broadcasting and works largely except for:

Similar to #520 I've defined two broadcasting styles, one for CPU <: Broadcast.AbstractArrayStyle{N} and one for GPU <: GPUArrays.AbstractGPUArrayStyle{N} which solves the problem from above. There is however, still a problem with non-zero preserving operations like .==

julia> L .== L
ERROR: BoundsError: attempt to access 5×5 LowerTriangularMatrix{Bool} at index [1, 2]

because this currently returns a LowerTriangularMatrix although it should be a BitMatrix (where a 1 can be written into the upper triangle). The corresponding operation with JLArrays works though (because the indexing is done differently only hitting the lower triangle)

julia> JL .== JL
5×5 LowerTriangularArray{Bool, 2, JLArray{Bool, 1}}:
 1  0  0  0  0
 1  1  0  0  0
 1  1  1  0  0
 1  1  1  1  0
 1  1  1  1  1

which is probably not what one would expect maybe, but not too bad worst case. I've just changed all tests to use L == L or JL == JL instead which works fine.

milankl · 2024-04-23T23:17:51Z

Next issue,

julia> L
5×5 LowerTriangularMatrix{Float64}:
 0.296704  0.0       0.0       0.0       0.0
 0.398559  0.926045  0.0       0.0       0.0
 0.958507  0.130742  0.300892  0.0       0.0
 0.867598  0.842258  0.588318  0.977474  0.0
 0.869255  0.989118  0.742295  0.500998  0.221757

julia> L2
5×5 LowerTriangularMatrix{ComplexF64}:
 0.750678+0.558921im        0.0+0.0im       …       0.0+0.0im           0.0+0.0im
 0.825657+0.14191im   0.0197592+0.542655im          0.0+0.0im           0.0+0.0im
 0.866794+0.687417im    0.55294+0.572049im          0.0+0.0im           0.0+0.0im
 0.219007+0.697757im   0.127504+0.590993im     0.454295+0.691951im      0.0+0.0im
 0.135136+0.433922im   0.668148+0.740522im     0.259046+0.285174im  0.21516+0.0390588im

julia> L2 .= L
ERROR: conflicting broadcast rules defined
  Broadcast.BroadcastStyle(::Main.SpeedyWeather.LowerTriangularMatrices.LowerTriangularStyle{2, Vector{ComplexF64}}, ::Main.SpeedyWeather.LowerTriangularMatrices.LowerTriangularStyle{2, Vector{Float64}}) = Main.SpeedyWeather.LowerTriangularMatrices.LowerTriangularStyle{2, Vector{ComplexF64}}()
  Broadcast.BroadcastStyle(::Main.SpeedyWeather.LowerTriangularMatrices.LowerTriangularStyle{2, Vector{Float64}}, ::Main.SpeedyWeather.LowerTriangularMatrices.LowerTriangularStyle{2, Vector{ComplexF64}}) = Main.SpeedyWeather.LowerTriangularMatrices.LowerTriangularStyle{2, Vector{Float64}}()
One of these should be undefined (and thus return Broadcast.Unknown).

Can probably be resolved by making sure the type T in the style isn't parametric...

milankl · 2024-04-24T18:25:18Z

Resolved now by making the ArrayType in the broadcasting style non-parametric

julia> L = ones(LowerTriangularMatrix{Float32},3,3)
3×3 LowerTriangularMatrix{Float32}:
 1.0  0.0  0.0
 1.0  1.0  0.0
 1.0  1.0  1.0

julia> L2 = rand(LowerTriangularMatrix{ComplexF64}, 3, 3)
3×3 LowerTriangularMatrix{ComplexF64}:
 0.406483+0.885732im        0.0+0.0im            0.0+0.0im
  0.85852+0.0421888im  0.811017+0.694692im       0.0+0.0im
 0.785031+0.28899im    0.123636+0.516475im  0.482183+0.623109im

julia> L + L2
3×3 LowerTriangularMatrix{ComplexF64}:
 1.40648+0.885732im       0.0+0.0im           0.0+0.0im
 1.85852+0.0421888im  1.81102+0.694692im      0.0+0.0im
 1.78503+0.28899im    1.12364+0.516475im  1.48218+0.623109im

…o preserving?

maximilian-gelbrecht · 2024-05-02T10:11:57Z

Great that the PR is merged!

Just one thing: as we discussed here (or somewhere else?) before, broadcasting is in this case often slower than the direct arithmetic functions that we defined for (+,*,scale!) etc. If we intend to use this in the dynamical core, it might be worth to add those functions again.

milankl · 2024-05-02T17:56:33Z

Yes, good call. I guess you really mean writing down the loops not just propagating the broadcasting to .data?

maximilian-gelbrecht · 2024-05-02T18:03:45Z

I mean as it was done before and removed in 20b0938

This is the old test I did (copied from our Zulip conversation) that used those functions

using SpeedyWeather, BenchmarkTools
NF = Float32
L_cpu = randn(LowerTriangularArray{NF}, 100, 100, 10)

julia> @benchmark L_cpu .* 4.5f0
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  103.897 μs …  17.866 ms  ┊ GC (min … max):  0.00% … 98.21%
 Time  (median):     268.755 μs               ┊ GC (median):     0.00%
 Time  (mean ± σ):   305.518 μs ± 557.126 μs  ┊ GC (mean ± σ):  11.72% ±  6.48%

  ▆                        ▃▆███▇▅▄▃▃▂▂▃▂▂▂▂▃▃▂▂▂▁▁ ▁           ▃
  ███▇▇▇▆▃▆▅▆▅▅▃▄▅▄▅▃▅▃▄▄▁▅██████████████████████████████▇▆▇▅▄▅ █
  104 μs        Histogram: log(frequency) by time        451 μs <

 Memory estimate: 390.78 KiB, allocs estimate: 4.

julia> @benchmark L_cpu * 4.5f0
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):   14.140 μs …  18.110 ms  ┊ GC (min … max):  0.00% … 99.06%
 Time  (median):      88.011 μs               ┊ GC (median):     0.00%
 Time  (mean ± σ):   112.584 μs ± 483.033 μs  ┊ GC (mean ± σ):  19.44% ±  4.72%

  ▄▃                     ▄▇██▇▆▆▆▅▄▃▂▁▁ ▁▁▁  ▁ ▁                ▃
  ██▆▄▆▆▁▄▅▅▃▁▁▃▄▄▁▃▁▁▁▄▅████████████████████████▇▇▇▇██▇█▇█▇▇▇▆ █
  14.1 μs       Histogram: log(frequency) by time        182 μs <

 Memory estimate: 197.39 KiB, allocs estimate: 3.

milankl · 2024-05-02T18:09:02Z

Of course yes, they should be added in, I forgot that there's such a difference!

maximilian-gelbrecht and others added 7 commits March 22, 2024 18:02

WIP: LWT rework (ndim+GPU)

295d27f

new getindex/setindex, some tests fixed

a197408

constructor indexing fixed

e1b9acc

some more errors fixed

73241ae

more tests passing

d14afe8

JLArray tests added, more tests passing

2941680

Merge branch 'main' into mg/ltm-gpu-ndim

166df59

milankl added gpu 🖼️ Everthing GPU related array types 🔢 LowerTriangularMatrices and RingGrids labels Mar 25, 2024

milankl reviewed Mar 25, 2024

View reviewed changes

eachharmonic non-allocating, copyto! fixed(?)

f5ce4b6

milankl mentioned this pull request Mar 25, 2024

GridVariable3D, 4D; SpectralVariable3D, 4D #493

Closed

resolved minor to-dos

0523e31

milankl reviewed Mar 26, 2024

View reviewed changes

src/LowerTriangularMatrices/lower_triangular_matrix.jl Outdated Show resolved Hide resolved

src/LowerTriangularMatrices/lower_triangular_matrix.jl Outdated Show resolved Hide resolved

maximilian-gelbrecht added 4 commits March 26, 2024 17:19

+, - and copyto! fixes, copy deleted

0aea36b

GPU broadcast partially working

6dc82e3

broadcast for cpu n-dim working

def116b

add k2ij function and turn tests back on

d7bc572

milankl reviewed Mar 27, 2024

View reviewed changes

src/LowerTriangularMatrices/lower_triangular_matrix.jl Outdated Show resolved Hide resolved

milankl reviewed Mar 27, 2024

View reviewed changes

src/LowerTriangularMatrices/lower_triangular_matrix.jl Show resolved Hide resolved

src/LowerTriangularMatrices/lower_triangular_matrix.jl Show resolved Hide resolved

src/LowerTriangularMatrices/lower_triangular_matrix.jl Outdated Show resolved Hide resolved

remove copy, inline triangle number

129b11c

Merge branch 'main' into mg/ltm-gpu-ndim

4aa9f1d

milankl mentioned this pull request Apr 1, 2024

Modified dynamics #507

Closed

maximilian-gelbrecht added 2 commits April 2, 2024 14:52

remove reduntant 2D copyto! for broadcast

9960fee

faster similar for same number type

dad1391

milankl added 2 commits April 17, 2024 16:57

getindex boundscheck, inbounds corrected, linting

1a71da6

More constructors, +constructor tests

a244ad4

More @inbounds checks

cfb79fc

disable out of bounds access for github actions CI (always checks)

bfb2822

milankl added 3 commits April 23, 2024 13:34

LowerTriangular arithmetic via broadcasting

20b0938

remove .== tests as broadcasting still fails for that

d3ec392

isstructurepreserving, fzeropresering still incorrectly used, always …

35d0ad5

…return false

milankl added 5 commits April 24, 2024 14:28

broadcasting with type promotion

138810a

maybe define all LowerTriangularArray operations as structure and zer…

f7c68a4

…o preserving?

isapprox also propagated to .data

090fd10

don't require type equality for == to compare across eltypes too

b5e7d6a

Merge branch 'main' into mg/ltm-gpu-ndim

6f757e7

milankl merged commit 00a2e27 into main Apr 25, 2024
4 checks passed

milankl deleted the mg/ltm-gpu-ndim branch April 25, 2024 21:37

maximilian-gelbrecht mentioned this pull request May 3, 2024

Add explicit arithmetics again to LowerTriangularArrays #531

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LowerTriangularMatrices: N-dimensional and GPU ready #503

LowerTriangularMatrices: N-dimensional and GPU ready #503

maximilian-gelbrecht commented Mar 25, 2024 •

edited

Loading

milankl commented Mar 26, 2024

maximilian-gelbrecht commented Mar 26, 2024

milankl commented Mar 26, 2024 •

edited

Loading

maximilian-gelbrecht commented Mar 28, 2024 •

edited

Loading

milankl commented Mar 28, 2024

milankl commented Apr 17, 2024 •

edited

Loading

milankl commented Apr 17, 2024

maximilian-gelbrecht commented Apr 19, 2024

milankl commented Apr 19, 2024

milankl commented Apr 19, 2024

milankl commented Apr 19, 2024

navidcy commented Apr 22, 2024

milankl commented Apr 22, 2024 •

edited

Loading

milankl commented Apr 23, 2024 •

edited

Loading

milankl commented Apr 23, 2024

milankl commented Apr 24, 2024

maximilian-gelbrecht commented May 2, 2024

milankl commented May 2, 2024

maximilian-gelbrecht commented May 2, 2024 •

edited

Loading

milankl commented May 2, 2024

LowerTriangularMatrices: N-dimensional and GPU ready #503

LowerTriangularMatrices: N-dimensional and GPU ready #503

Conversation

maximilian-gelbrecht commented Mar 25, 2024 • edited Loading

milankl commented Mar 26, 2024

maximilian-gelbrecht commented Mar 26, 2024

milankl commented Mar 26, 2024 • edited Loading

maximilian-gelbrecht commented Mar 28, 2024 • edited Loading

milankl commented Mar 28, 2024

milankl commented Apr 17, 2024 • edited Loading

milankl commented Apr 17, 2024

maximilian-gelbrecht commented Apr 19, 2024

milankl commented Apr 19, 2024

milankl commented Apr 19, 2024

milankl commented Apr 19, 2024

navidcy commented Apr 22, 2024

milankl commented Apr 22, 2024 • edited Loading

milankl commented Apr 23, 2024 • edited Loading

milankl commented Apr 23, 2024

milankl commented Apr 24, 2024

maximilian-gelbrecht commented May 2, 2024

milankl commented May 2, 2024

maximilian-gelbrecht commented May 2, 2024 • edited Loading

milankl commented May 2, 2024

maximilian-gelbrecht commented Mar 25, 2024 •

edited

Loading

milankl commented Mar 26, 2024 •

edited

Loading

maximilian-gelbrecht commented Mar 28, 2024 •

edited

Loading

milankl commented Apr 17, 2024 •

edited

Loading

milankl commented Apr 22, 2024 •

edited

Loading

milankl commented Apr 23, 2024 •

edited

Loading

maximilian-gelbrecht commented May 2, 2024 •

edited

Loading