Feature request: more KnetArray indexing #198

ngphuoc · 2017-11-23T03:58:32Z

The following would be useful:

getindex(::Knet.KnetArray{Float32,3}, ::Colon, ::Int64, ::Int64)
getindex(::Knet.KnetArray{Float32,3}, ::Colon, ::Array{Int64,1}, ::Int64)
getindex(::Knet.KnetArray{Float32,3}, ::Colon, ::Array{Int64,1},  ::Array{Int64,1})

ngphuoc · 2017-11-30T23:31:09Z

And these are needed to reorder hidden states between runs in RNN with dynamic sizes

getindex(::Knet.KnetArray{Float32,3}, ::Colon, ::Array{Int64,1}, ::Colon)
getindex(::Knet.KnetArray{Float32,3}, ::Colon, ::UnitRange{Int64}, ::Colon)
getindex(::Knet.KnetArray{Float32,3}, ::Colon, ::StepRange{Int64,Int64}, ::Colon)

denizyuret · 2017-12-01T18:16:09Z

@maleadt I wonder if we can somehow use the kernels generated by CUDAnative instead of trying to implement all these indexing variations by hand. I would generate code for these using CUDAnative and put them in libknet8.so for now, just not sure how. This would be a stopgap measure until CUDAnative does not require Julia recompile and I can use CuArrays etc.

maleadt · 2017-12-01T18:31:30Z

Yeah definitely. I'm not sure how the kernels would look like, but you could use metaprogramming to generate function definitions, call code_ptx on them (possibly using different type signatures), collect the generated PTX and link it all together into a single PTX to send to the CUDA driver when Knet.jl is loaded. For performance reasons, you'll want to look into using CUDAnative's DevicePtr and performing aligned loads, but that has been implemented in CUDAnative already and is just a matter of figuring out how to use that infrastructure.

If you can come up with a simple example I could look into creating a proof of concept.

denizyuret · 2017-12-02T13:11:45Z

OK, first I wanted to make sure that we can cover the indexing operations with CUDAnative using CuArrays. I got a SegFault in Julia release-0.6. Tagging @maleadt, @MikeInnes, @SimonDanisch.
Here is the test script: https://gist.github.com/denizyuret/0100c8383df9cf4cd0106f54599603c5
Here is the segfault: https://gist.github.com/denizyuret/336ed2e44bf68bb6ff79a85b082231fd

ilkerkesen · 2017-12-02T14:28:07Z

I got a CUDA error instead of a segfault.

https://gist.github.com/ilkerkesen/c921d41894b8667d8acc6e71803f04fe

I am using Julia v0.6 and the full path is,

/KUFS/apps/julia/0.6.0-compiled/usr/bin

SimonDanisch · 2017-12-02T15:25:19Z

CLArrays seems to work fine, so I guess this is CUDAnative specific (CuArrays and CLArrays should share the same indexing code, if I'm not missing something)

SimonDanisch · 2017-12-02T15:28:09Z

Ah no, it's just almost the same code... Let me check the specific differences.

SimonDanisch · 2017-12-02T15:55:41Z

I'm running out of time and couldn't get CuArrays running with all the changes in LLVM/CUDAnative.
But for anyone who has it working already, there is a simple thing you can try.

Just comment out the indexing include in CuArrays.jl at https://github.com/JuliaGPU/CuArrays.jl/blob/master/src/CuArrays.jl#L13

And try it again. Then it should use the same code as CLArrays for indexing!

maleadt · 2017-12-02T16:24:44Z

I'm running out of time and couldn't get CuArrays running with all the changes in LLVM/CUDAnative.

Tagged versions / using Pkg should work, respecting REQUIRE is ... required.

SimonDanisch · 2017-12-02T16:31:28Z

Hm, maybe I messed up somewhere - but all I did was Pkg.update, and it told me I need 0.7 for LLVM.jl. Then I started manually downgrading relevant packages, and since it always took a long time for each package, I ended up in the mentioned situation ;)

maleadt · 2017-12-02T16:33:01Z

Huh, 0.5.1 is the last version of LLVM.jl that supports 0.6, 0.9.0 does indeed require 0.7 but that shouldn't get auto-installed, right?

EDIT: this is getting a bit off topic, we shouldn't be hijacking this issue I guess. @SimonDanisch, could you open an issue on LLVM.jl if Pkg did effective update to an incompatible version?

SimonDanisch · 2017-12-02T16:34:52Z

Weird! Yeah not sure what's going on ;)

denizyuret · 2017-12-03T06:59:01Z

@SimonDanisch how can I repeat your test with CLArrays? I guess I am confused about the relationship between GPUArrays, CLArrays, CuArrays etc.

denizyuret · 2017-12-03T07:06:02Z

I can confirm that as @SimonDanisch suggested commenting out https://github.com/JuliaGPU/CuArrays.jl/blob/master/src/CuArrays.jl#L13, i.e. include("indexing.jl") stopped the error and now the test runs fine.
@maleadt can we use one of these indexing operations as the proof-of-concept for generating and loading code?

SimonDanisch · 2017-12-03T12:49:26Z

Well, both inherit from GPUArrays.GPUArray, which is what Base.AbstractArray is for GPUs.

GPUArrays also contains hardware independent kernels (e.g. for broadcasting and indexing) and in theory Cu-CLArrays should share all those kernels and interfaces.

Long story short: @MikeInnes wanted to take things slowly, so almost all code in CuArrays was kept for now, overwriting the hardware independent kernels from GPUArrays. The replacement with the GPUArray kernels will come piece by piece.

But for every kernel in CuArrays there should already be a hardware independent version in GPUArrays (which should offer the same functionality at the same execution speed).

So CLArrays itself doesn't contain a single kernel and uses GPUArrays hardware independent kernels exclusively - which are pretty much a mixture of the best kernels from CuArrays (only rewritten to be hardware independent) and the old GPUArrays.

Regarding using CLArrays:

using CLArrays

a = rand(4,4,4)
@show summary(a)
@show summary(a[:,2,2])
@show summary(a[:,[2,4],2])
@show summary(a[:,[2,4],[1,3]])
@show summary(a[:,[2,4],:])
@show summary(a[:,2:4,:])
@show summary(a[:,1:2:4,:])
a = CLArray(a) 
@show summary(a)
@show summary(a[:,2,2])
@show summary(a[:,[2,4],2])
@show summary(a[:,[2,4],[1,3]])
@show summary(a[:,[2,4],:])
@show summary(a[:,2:4,:])
@show summary(a[:,1:2:4,:])

denizyuret · 2017-12-05T11:20:13Z

@SimonDanisch, @maleadt how can I see the PTX code behind these indexing operations with CLArrays or CuArrays? Is there a separate kernel compiled for each distinct getindex arg signature? How does the same PTX handle different index ranges or different index arrays?

maleadt · 2017-12-05T11:26:14Z

You can use CUDAnative's @code_ptx or code_ptx, similar to @code_llvm and code_llvm (ie. prepending it to an invocation, or passing a tuple type). And codegen behavior is pretty identical to Base, meaning new PTX code is generated for every new specialization.

EDIT: of course, @code_ptx needs to be prepended to a @cuda invocation, and doesn't work when stuck in front of eg. summary. I guess that's a little inconvenient, and we might have to make a more convenient version of @code_... for that use case.

SimonDanisch · 2017-12-05T12:04:47Z

I don't really have a nice API for this, but you can quite easily access any compiled kernels like this:

julia> a[:,2,2] # compile an indexing operation
GPU: 4-element Array{Float64,1}:
 0.337476
 0.673492
 0.28916 
 0.260523

julia> CLArrays.compiled_functions # the compiled kernel cache
Dict{Any,Any} with 1 entry:
  (Ptr{Void} @0x00000000036b3760, GPUArrays.index_kernel, (CLArrays… => CLArrays.CLFunction{GPUArrays.#index_kernel,Tuple{CLArrays.KernelState,CLArrays.CLArray{Float64,1},CLArrays.CLArray{Float64,3},Tupl…

julia> signature = first(CLArrays.compiled_functions)[1][2:end] # the signature of a compiled kernel
(GPUArrays.index_kernel, (CLArrays.KernelState, CLArrays.DeviceArray{Float64,1,Transpiler.CLIntrinsics.GlobalPointer{Float64}}, CLArrays.DeviceArray{Float64,3,Transpiler.CLIntrinsics.GlobalPointer{Float64}}, Tuple{UInt32,UInt32,UInt32}, Tuple{Base.Slice{Base.OneTo{Int64}},Int64,Int64}))

julia> m = Transpiler.CLMethod(signature)
GPUArrays.index_kernel(CLArrays.KernelState, CLArrays.DeviceArray{Float64,1,Transpiler.CLIntrinsics.GlobalPointer{Float64}}, CLArrays.DeviceArray{Float64,3,Transpiler.CLIntrinsics.GlobalPointer{Float64}}, Tuple{UInt32,UInt32,UInt32}, Tuple{Base.Slice{Base.OneTo{Int64}},Int64,Int64})

julia> Sugar.getast!(m) # compiled julia expr
quote 
    $(Expr(:inbounds, true))
    I_3::Int64::Int64
    I_2::Int64::Int64
    I_1::Int64::Int64
    is::Tuple{UInt32,UInt32,UInt32}::Tuple{UInt32,UInt32,UInt32}
    i::UInt32::UInt32
    _7::UInt32 = (GPUArrays.linear_index)(_2::CLArrays.KernelState)::UInt32
    if (>)(_7::UInt32, (length)(_3::CLArrays.DeviceArray{Float64,1,Transpiler.CLIntrinsics.GlobalPointer{Float64}})::UInt32)::Bool
        $(Expr(:return))
    end
    _8::Tuple{UInt32,UInt32,UInt32} = (GPUArrays.gpu_ind2sub)(_5::Tuple{UInt32,UInt32,UInt32}, _7::UInt32)::Tuple{UInt32,UInt32,UInt32}
    _9::Int64 = (getindex)((getfield)(_6::Tuple{Base.Slice{Base.OneTo{Int64}},Int64,Int64}, field1)::Base.Slice{Base.OneTo{Int64}}, (Int64)((getfield)(_8::Tuple{UInt32,UInt32,UInt32}, s0)::UInt32)::Int64)::Int64
    _10::Int64 = (getindex)((getfield)(_6::Tuple{Base.Slice{Base.OneTo{Int64}},Int64,Int64}, field2)::Int64, (Int64)((getfield)(_8::Tuple{UInt32,UInt32,UInt32}, s1)::UInt32)::Int64)::Int64
    _11::Int64 = (getindex)((getfield)(_6::Tuple{Base.Slice{Base.OneTo{Int64}},Int64,Int64}, field3)::Int64, (Int64)((getfield)(_8::Tuple{UInt32,UInt32,UInt32}, s2)::UInt32)::Int64)::Int64
    _12::Float64::Float64::Float64
    _12::Float64 = (getindex)(_4::CLArrays.DeviceArray{Float64,3,Transpiler.CLIntrinsics.GlobalPointer{Float64}}, (Tuple{Int64,Int64,Int64}){_9::Int64, _10::Int64, _11::Int64}::Tuple{Int64,Int64,Int64})::Float64
    (setindex!)(_3::CLArrays.DeviceArray{Float64,1,Transpiler.CLIntrinsics.GlobalPointer{Float64}}, _12::Float64, (Tuple{UInt32}){_7::UInt32}::Tuple{UInt32})::Void
    $(Expr(:return))
    $(Expr(:inbounds, :pop))
end

julia> Sugar.getsource!(m) |> println # opencl code
void index_kernel_29(KernelState state, DeviceArray_double_1___global1double121 dest, DeviceArray_double_3___global1double121 src, uint3 idims, Tuple_Slice_OneTo_long_long_long Is)
{
    ;
    long I_3;
    long I_2;
    long I_1;
    uint3 is;
    uint i;
    i = linear_index_5(state);
    if(i > length_8(dest)){
        return;
    };
    is = gpu_ind2sub_15(idims, i);
    I_1 = getindex_16(Is.field1, (long)(is.s0));
    I_2 = getindex_17(Is.field2, (long)(is.s1));
    I_3 = getindex_17(Is.field3, (long)(is.s2));
    double _ssavalue_0;
    _ssavalue_0 = getindex_25(src, (long3){I_1, I_2, I_3});
    setindex9_28(dest, _ssavalue_0, (uint){i});
    return;
    ;
}

denizyuret · 2017-12-05T12:14:09Z

How about with cuda or CuArrays?
So far I've got:

julia> @code_ptx a[:,2,2]
ERROR: AssertionError: func isa Core.Function
julia> @code_ptx getindex(a,:,2,2)
ERROR: ArgumentError: invalid device function getindex(CUDAnative.CuDeviceArray{Float32,3,CUDAnative.AS.Global}, Colon, Int64, Int64): is a varargs function
julia> code_ptx(getindex, (CuArray,Colon,Int64,Int64))
ERROR: ArgumentError: invalid call to device function getindex(CuArrays.CuArray, Colon, Int64, Int64): passing abstract arguments

SimonDanisch · 2017-12-05T12:20:55Z

For CuArrays, you need to find out the signature of the called kernel. Maybe with Astinerpreter2 or print statements in the indexing code. If it uses the GPUArray indexing code, you can also insert a statement here: https://github.com/JuliaGPU/CuArrays.jl/blob/master/src/gpuarray_interface.jl#L61
Since that's where all kernel calls should route you to.
I think you can just put a CUDAnative.@code_ptx here:

function GPUArrays._gpu_call(f, A::CuArray, args::Tuple, blocks_threads::Tuple{T, T}) where T <: NTuple{N, Integer} where N
    blocks, threads = blocks_threads
    CUDAnative.@code_typed @cuda (blocks, threads) f(CuKernelState(), args...)
    CUDAnative.@code_llvm @cuda (blocks, threads) f(CuKernelState(), args...)
    CUDAnative.@code_ptx @cuda (blocks, threads) f(CuKernelState(), args...)

    @cuda (blocks, threads) f(CuKernelState(), args...)
end

SimonDanisch · 2017-12-05T12:21:50Z

Or here:
https://github.com/JuliaGPU/CuArrays.jl/blob/master/src/indexing.jl#L63

maleadt · 2017-12-05T12:29:00Z

With Cassette, I think we could create an interface (eg. @device_code_llvm) that overdubs the call to the compiler and visualizes that code. For now, I'd rather enable the following print statement: https://github.com/JuliaGPU/CUDAnative.jl/blob/6ab4665dcd3a865c506c86a5d082eb979ceb48b7/src/jit.jl#L394

Doing so, you get to see the following:

julia> using CuArrays, CUDAnative

julia> a = rand(4,4,4)

julia> da = cu(a)

julia> da[:, 2,2]

DEBUG: (Re)compiling index_kernel(CUDAnative.CuDeviceArray{Float32,1,CUDAnative.AS.Global}, CUDAnative.CuDeviceArray{Float32,3,CUDAnative.AS.Global}, Tuple{Int64,Int64,Int64}, Tuple{Base.Slice{Base.OneTo{Int64}},Int64,Int64}) for capability 3.5.0

With this info, you can:

julia> CUDAnative.code_ptx(CuArrays.index_kernel, Tuple{CUDAnative.CuDeviceArray{Float32,1,CUDAnative.AS.Global}, CUDAnative.CuDeviceArray{Float32,3,CUDAnative.AS.Global}, Tuple{Int64,Int64,Int64}, Tuple{Base.Slice{Base.OneTo{Int64}},Int64,Int64}}; cap=v"3.5", kernel=true)

So yeah, you'll probably need a better interface if you want to use CUDAnative as a PTX compiler.

denizyuret · 2017-12-05T12:37:57Z

I got some code out with @SimonDanisch's hack. See if I can use them. Different calls to the same function generate the same ptx code except for symbols ptxcall_index_kernel_#####_param_#. I'll go learn some PTX now.

maleadt · 2017-12-05T16:41:33Z

Before we delve into making CUDAnative behave like a static compiler, why can't you use the stack "as intended", ie. to execute GPU code (where the JIT compilation to PTX is an implementation detail)? Installation issues for CUDAnative should be resolved for 1.0 (they are already on 0.7).

denizyuret · 2017-12-06T05:22:37Z

Just as a stop-gap measure. I don't know how far away 0.7 or 1.0 is. I just introduced cudnn RNNs and the users want various 3-D indexing and cat operations to support it. I have the option of writing kernels myself as before (there are just too many combinations), steal from CUDAnative, or just tell the users to wait. Of course once CUDAnative works out of the box none of the static code will be necessary, at that point I will just use CUDAnative with KnetArrays or use one of CuArrays, CLArrays if their performance and coverage are sufficient.

…

On Tue, Dec 5, 2017 at 7:41 PM Tim Besard ***@***.***> wrote: Before we delve into making CUDAnative behave like a static compiler, why can't you use the stack "as intended", ie. to execute GPU code (where the JIT compilation to PTX is an implementation detail)? Installation issues for CUDAnative should be resolved for 1.0 (they are already on 0.7). — You are receiving this because you were assigned. Reply to this email directly, view it on GitHub <#198 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABvNpu6iZkBZEwllp9mcdSwdnTJpTqgSks5s9XI-gaJpZM4QoQCJ> .

ngphuoc · 2017-12-19T06:44:04Z

Currently I only need this indexing:

function getindex(a::KnetArray{T,3}, i::Colon, j, k::Colon) where T
  s = size(a)
  a = permutedims(a, [1,3,2])
  a = reshape(a,(:,s[2]))
  a = a[:,j]
  a = reshape(a, (s[1],s[3],s[2]))
  permutedims(a, [1,3,2])
end

Could you help me with its setindex! counterpart? (and its primitive if needed). Many thanks!

ngphuoc · 2017-12-21T00:10:35Z

No worries. I just made it a function. It's interesting that using 2D indexing with sub2ind is 20 times faster than using permutedims in my case:

function getat(x::KnetArray{T,3}, i::Colon, j, k) where T
  I,J,K = size(x)
  k==(:) && (k=1:K)
  x = reshape(x,(I,J*K));
  js = repeat(j, outer=K);
  ks = repeat(k, inner=J);
  ix = sub2ind((J,K), js, ks);
  x=x[:,ix]
reshape(x,(I,J,K))
end

CarloLucibello · 2017-12-21T07:52:58Z

Hi @ngphuoc, you could also take a look to this PR #229, which amounts to

function getindex(a::KnetArray, I...) 
    crange = CartesianRange(to_indices(a, I))
    linind = [sub2ind(size(a), t.I...) for t in crange]
    b = getindex(a, vec(linind))
    shape = size(crange) # TODO drop scalar dimension
    reshape(b, shape)
end

function setindex!(a::KnetArray, v, I...) 
    crange = CartesianRange(to_indices(a, I))
    linind = [sub2ind(size(a), t.I...) for t in crange]
    setindex!(a, v, vec(linind))
end

ngphuoc · 2017-12-21T10:20:59Z

That's excellent! Thanks @CarloLucibello

ngphuoc · 2017-12-22T02:34:28Z

It gives the following error on julia 0.6.2:

julia> a[:,i,:];
ERROR: MethodError: Cannot `convert` an object of type Tuple{Base.Slice{Base.OneTo{Int64}},Array{Int64,1},Base.Slice{Base.OneTo{Int64}}} to an object of type CartesianRange
This may have arisen from a call to the constructor CartesianRange(...),
since type constructors fall back to convert methods.
Stacktrace:
 [1] getindex(::Knet.KnetArray{Float32,3}, ::Colon, ::Array{Int64,1}, ::Colon) at ./REPL[11]:3

CarloLucibello · 2017-12-22T06:08:01Z

Supported indexes are columns, scalars and unit ranges for the time being

ngphuoc · 2017-12-22T14:07:53Z

I found sub2ind for each index at a time is 300 times slower than for all indices at once in my case.

denizyuret · 2019-08-29T18:29:35Z

CuArrays fallbacks fix this in 85c4125

denizyuret added the enhancement label Nov 28, 2017

denizyuret assigned denizyuret and ilkerkesen Dec 1, 2017

denizyuret mentioned this issue Dec 2, 2017

SegFault for some getindex calls JuliaGPU/CuArrays.jl#38

Closed

denizyuret added the 0.8.6 label Dec 3, 2017

denizyuret added the indexing label Dec 24, 2017

maleadt mentioned this issue Jan 16, 2018

Rework device code reflection. JuliaGPU/CUDAnative.jl#147

Merged

3 tasks

denizyuret added KnetArray project labels Jan 8, 2019

ilkerkesen removed their assignment Jan 15, 2019

denizyuret closed this as completed Aug 29, 2019

denizyuret mentioned this issue Aug 29, 2019

dy/comp542 #480

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature request: more KnetArray indexing #198

Feature request: more KnetArray indexing #198

ngphuoc commented Nov 23, 2017

ngphuoc commented Nov 30, 2017 •

edited

Loading

denizyuret commented Dec 1, 2017

maleadt commented Dec 1, 2017 •

edited

Loading

denizyuret commented Dec 2, 2017

ilkerkesen commented Dec 2, 2017

SimonDanisch commented Dec 2, 2017

SimonDanisch commented Dec 2, 2017

SimonDanisch commented Dec 2, 2017

maleadt commented Dec 2, 2017 •

edited

Loading

SimonDanisch commented Dec 2, 2017

maleadt commented Dec 2, 2017 •

edited

Loading

SimonDanisch commented Dec 2, 2017

denizyuret commented Dec 3, 2017

denizyuret commented Dec 3, 2017

SimonDanisch commented Dec 3, 2017

denizyuret commented Dec 5, 2017

maleadt commented Dec 5, 2017 •

edited

Loading

SimonDanisch commented Dec 5, 2017

denizyuret commented Dec 5, 2017

SimonDanisch commented Dec 5, 2017

SimonDanisch commented Dec 5, 2017

maleadt commented Dec 5, 2017

denizyuret commented Dec 5, 2017

maleadt commented Dec 5, 2017

denizyuret commented Dec 6, 2017 via email

ngphuoc commented Dec 19, 2017

ngphuoc commented Dec 21, 2017 •

edited

Loading

CarloLucibello commented Dec 21, 2017

ngphuoc commented Dec 21, 2017

ngphuoc commented Dec 22, 2017

CarloLucibello commented Dec 22, 2017

ngphuoc commented Dec 22, 2017 •

edited

Loading

denizyuret commented Aug 29, 2019

Feature request: more KnetArray indexing #198

Feature request: more KnetArray indexing #198

Comments

ngphuoc commented Nov 23, 2017

ngphuoc commented Nov 30, 2017 • edited Loading

denizyuret commented Dec 1, 2017

maleadt commented Dec 1, 2017 • edited Loading

denizyuret commented Dec 2, 2017

ilkerkesen commented Dec 2, 2017

SimonDanisch commented Dec 2, 2017

SimonDanisch commented Dec 2, 2017

SimonDanisch commented Dec 2, 2017

maleadt commented Dec 2, 2017 • edited Loading

SimonDanisch commented Dec 2, 2017

maleadt commented Dec 2, 2017 • edited Loading

SimonDanisch commented Dec 2, 2017

denizyuret commented Dec 3, 2017

denizyuret commented Dec 3, 2017

SimonDanisch commented Dec 3, 2017

denizyuret commented Dec 5, 2017

maleadt commented Dec 5, 2017 • edited Loading

SimonDanisch commented Dec 5, 2017

denizyuret commented Dec 5, 2017

SimonDanisch commented Dec 5, 2017

SimonDanisch commented Dec 5, 2017

maleadt commented Dec 5, 2017

denizyuret commented Dec 5, 2017

maleadt commented Dec 5, 2017

denizyuret commented Dec 6, 2017 via email

ngphuoc commented Dec 19, 2017

ngphuoc commented Dec 21, 2017 • edited Loading

CarloLucibello commented Dec 21, 2017

ngphuoc commented Dec 21, 2017

ngphuoc commented Dec 22, 2017

CarloLucibello commented Dec 22, 2017

ngphuoc commented Dec 22, 2017 • edited Loading

denizyuret commented Aug 29, 2019

ngphuoc commented Nov 30, 2017 •

edited

Loading

maleadt commented Dec 1, 2017 •

edited

Loading

maleadt commented Dec 2, 2017 •

edited

Loading

maleadt commented Dec 2, 2017 •

edited

Loading

maleadt commented Dec 5, 2017 •

edited

Loading

ngphuoc commented Dec 21, 2017 •

edited

Loading

ngphuoc commented Dec 22, 2017 •

edited

Loading