-
Notifications
You must be signed in to change notification settings - Fork 230
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature request: more KnetArray indexing #198
Comments
And these are needed to reorder hidden states between runs in RNN with dynamic sizes getindex(::Knet.KnetArray{Float32,3}, ::Colon, ::Array{Int64,1}, ::Colon)
getindex(::Knet.KnetArray{Float32,3}, ::Colon, ::UnitRange{Int64}, ::Colon)
getindex(::Knet.KnetArray{Float32,3}, ::Colon, ::StepRange{Int64,Int64}, ::Colon) |
@maleadt I wonder if we can somehow use the kernels generated by CUDAnative instead of trying to implement all these indexing variations by hand. I would generate code for these using CUDAnative and put them in libknet8.so for now, just not sure how. This would be a stopgap measure until CUDAnative does not require Julia recompile and I can use CuArrays etc. |
Yeah definitely. I'm not sure how the kernels would look like, but you could use metaprogramming to generate function definitions, call If you can come up with a simple example I could look into creating a proof of concept. |
OK, first I wanted to make sure that we can cover the indexing operations with CUDAnative using CuArrays. I got a SegFault in Julia release-0.6. Tagging @maleadt, @MikeInnes, @SimonDanisch. |
I got a CUDA error instead of a segfault. https://gist.github.com/ilkerkesen/c921d41894b8667d8acc6e71803f04fe I am using Julia v0.6 and the full path is,
|
CLArrays seems to work fine, so I guess this is CUDAnative specific (CuArrays and CLArrays should share the same indexing code, if I'm not missing something) |
Ah no, it's just almost the same code... Let me check the specific differences. |
I'm running out of time and couldn't get Just comment out the indexing include in CuArrays.jl at https://github.com/JuliaGPU/CuArrays.jl/blob/master/src/CuArrays.jl#L13 And try it again. Then it should use the same code as CLArrays for indexing! |
Tagged versions / using Pkg should work, respecting REQUIRE is ... required. |
Hm, maybe I messed up somewhere - but all I did was Pkg.update, and it told me I need 0.7 for LLVM.jl. Then I started manually downgrading relevant packages, and since it always took a long time for each package, I ended up in the mentioned situation ;) |
Huh, 0.5.1 is the last version of LLVM.jl that supports 0.6, 0.9.0 does indeed require 0.7 but that shouldn't get auto-installed, right?
EDIT: this is getting a bit off topic, we shouldn't be hijacking this issue I guess. @SimonDanisch, could you open an issue on LLVM.jl if Pkg did effective update to an incompatible version? |
Weird! Yeah not sure what's going on ;) |
@SimonDanisch how can I repeat your test with CLArrays? I guess I am confused about the relationship between GPUArrays, CLArrays, CuArrays etc. |
I can confirm that as @SimonDanisch suggested commenting out https://github.com/JuliaGPU/CuArrays.jl/blob/master/src/CuArrays.jl#L13, i.e. |
Well, both inherit from GPUArrays also contains hardware independent kernels (e.g. for broadcasting and indexing) and in theory Cu-CLArrays should share all those kernels and interfaces. Long story short: @MikeInnes wanted to take things slowly, so almost all code in CuArrays was kept for now, overwriting the hardware independent kernels from GPUArrays. The replacement with the GPUArray kernels will come piece by piece. But for every kernel in CuArrays there should already be a hardware independent version in GPUArrays (which should offer the same functionality at the same execution speed). So CLArrays itself doesn't contain a single kernel and uses GPUArrays hardware independent kernels exclusively - which are pretty much a mixture of the best kernels from CuArrays (only rewritten to be hardware independent) and the old GPUArrays. Regarding using CLArrays: using CLArrays
a = rand(4,4,4)
@show summary(a)
@show summary(a[:,2,2])
@show summary(a[:,[2,4],2])
@show summary(a[:,[2,4],[1,3]])
@show summary(a[:,[2,4],:])
@show summary(a[:,2:4,:])
@show summary(a[:,1:2:4,:])
a = CLArray(a)
@show summary(a)
@show summary(a[:,2,2])
@show summary(a[:,[2,4],2])
@show summary(a[:,[2,4],[1,3]])
@show summary(a[:,[2,4],:])
@show summary(a[:,2:4,:])
@show summary(a[:,1:2:4,:]) |
@SimonDanisch, @maleadt how can I see the PTX code behind these indexing operations with CLArrays or CuArrays? Is there a separate kernel compiled for each distinct getindex arg signature? How does the same PTX handle different index ranges or different index arrays? |
You can use CUDAnative's EDIT: of course, |
I don't really have a nice API for this, but you can quite easily access any compiled kernels like this: julia> a[:,2,2] # compile an indexing operation
GPU: 4-element Array{Float64,1}:
0.337476
0.673492
0.28916
0.260523
julia> CLArrays.compiled_functions # the compiled kernel cache
Dict{Any,Any} with 1 entry:
(Ptr{Void} @0x00000000036b3760, GPUArrays.index_kernel, (CLArrays… => CLArrays.CLFunction{GPUArrays.#index_kernel,Tuple{CLArrays.KernelState,CLArrays.CLArray{Float64,1},CLArrays.CLArray{Float64,3},Tupl…
julia> signature = first(CLArrays.compiled_functions)[1][2:end] # the signature of a compiled kernel
(GPUArrays.index_kernel, (CLArrays.KernelState, CLArrays.DeviceArray{Float64,1,Transpiler.CLIntrinsics.GlobalPointer{Float64}}, CLArrays.DeviceArray{Float64,3,Transpiler.CLIntrinsics.GlobalPointer{Float64}}, Tuple{UInt32,UInt32,UInt32}, Tuple{Base.Slice{Base.OneTo{Int64}},Int64,Int64}))
julia> m = Transpiler.CLMethod(signature)
GPUArrays.index_kernel(CLArrays.KernelState, CLArrays.DeviceArray{Float64,1,Transpiler.CLIntrinsics.GlobalPointer{Float64}}, CLArrays.DeviceArray{Float64,3,Transpiler.CLIntrinsics.GlobalPointer{Float64}}, Tuple{UInt32,UInt32,UInt32}, Tuple{Base.Slice{Base.OneTo{Int64}},Int64,Int64})
julia> Sugar.getast!(m) # compiled julia expr
quote
$(Expr(:inbounds, true))
I_3::Int64::Int64
I_2::Int64::Int64
I_1::Int64::Int64
is::Tuple{UInt32,UInt32,UInt32}::Tuple{UInt32,UInt32,UInt32}
i::UInt32::UInt32
_7::UInt32 = (GPUArrays.linear_index)(_2::CLArrays.KernelState)::UInt32
if (>)(_7::UInt32, (length)(_3::CLArrays.DeviceArray{Float64,1,Transpiler.CLIntrinsics.GlobalPointer{Float64}})::UInt32)::Bool
$(Expr(:return))
end
_8::Tuple{UInt32,UInt32,UInt32} = (GPUArrays.gpu_ind2sub)(_5::Tuple{UInt32,UInt32,UInt32}, _7::UInt32)::Tuple{UInt32,UInt32,UInt32}
_9::Int64 = (getindex)((getfield)(_6::Tuple{Base.Slice{Base.OneTo{Int64}},Int64,Int64}, field1)::Base.Slice{Base.OneTo{Int64}}, (Int64)((getfield)(_8::Tuple{UInt32,UInt32,UInt32}, s0)::UInt32)::Int64)::Int64
_10::Int64 = (getindex)((getfield)(_6::Tuple{Base.Slice{Base.OneTo{Int64}},Int64,Int64}, field2)::Int64, (Int64)((getfield)(_8::Tuple{UInt32,UInt32,UInt32}, s1)::UInt32)::Int64)::Int64
_11::Int64 = (getindex)((getfield)(_6::Tuple{Base.Slice{Base.OneTo{Int64}},Int64,Int64}, field3)::Int64, (Int64)((getfield)(_8::Tuple{UInt32,UInt32,UInt32}, s2)::UInt32)::Int64)::Int64
_12::Float64::Float64::Float64
_12::Float64 = (getindex)(_4::CLArrays.DeviceArray{Float64,3,Transpiler.CLIntrinsics.GlobalPointer{Float64}}, (Tuple{Int64,Int64,Int64}){_9::Int64, _10::Int64, _11::Int64}::Tuple{Int64,Int64,Int64})::Float64
(setindex!)(_3::CLArrays.DeviceArray{Float64,1,Transpiler.CLIntrinsics.GlobalPointer{Float64}}, _12::Float64, (Tuple{UInt32}){_7::UInt32}::Tuple{UInt32})::Void
$(Expr(:return))
$(Expr(:inbounds, :pop))
end
julia> Sugar.getsource!(m) |> println # opencl code
void index_kernel_29(KernelState state, DeviceArray_double_1___global1double121 dest, DeviceArray_double_3___global1double121 src, uint3 idims, Tuple_Slice_OneTo_long_long_long Is)
{
;
long I_3;
long I_2;
long I_1;
uint3 is;
uint i;
i = linear_index_5(state);
if(i > length_8(dest)){
return;
};
is = gpu_ind2sub_15(idims, i);
I_1 = getindex_16(Is.field1, (long)(is.s0));
I_2 = getindex_17(Is.field2, (long)(is.s1));
I_3 = getindex_17(Is.field3, (long)(is.s2));
double _ssavalue_0;
_ssavalue_0 = getindex_25(src, (long3){I_1, I_2, I_3});
setindex9_28(dest, _ssavalue_0, (uint){i});
return;
;
} |
How about with cuda or CuArrays?
|
For CuArrays, you need to find out the signature of the called kernel. Maybe with Astinerpreter2 or print statements in the indexing code. If it uses the GPUArray indexing code, you can also insert a statement here: https://github.com/JuliaGPU/CuArrays.jl/blob/master/src/gpuarray_interface.jl#L61 function GPUArrays._gpu_call(f, A::CuArray, args::Tuple, blocks_threads::Tuple{T, T}) where T <: NTuple{N, Integer} where N
blocks, threads = blocks_threads
CUDAnative.@code_typed @cuda (blocks, threads) f(CuKernelState(), args...)
CUDAnative.@code_llvm @cuda (blocks, threads) f(CuKernelState(), args...)
CUDAnative.@code_ptx @cuda (blocks, threads) f(CuKernelState(), args...)
@cuda (blocks, threads) f(CuKernelState(), args...)
end |
With Cassette, I think we could create an interface (eg. Doing so, you get to see the following: julia> using CuArrays, CUDAnative
julia> a = rand(4,4,4)
julia> da = cu(a)
julia> da[:, 2,2]
DEBUG: (Re)compiling index_kernel(CUDAnative.CuDeviceArray{Float32,1,CUDAnative.AS.Global}, CUDAnative.CuDeviceArray{Float32,3,CUDAnative.AS.Global}, Tuple{Int64,Int64,Int64}, Tuple{Base.Slice{Base.OneTo{Int64}},Int64,Int64}) for capability 3.5.0 With this info, you can: julia> CUDAnative.code_ptx(CuArrays.index_kernel, Tuple{CUDAnative.CuDeviceArray{Float32,1,CUDAnative.AS.Global}, CUDAnative.CuDeviceArray{Float32,3,CUDAnative.AS.Global}, Tuple{Int64,Int64,Int64}, Tuple{Base.Slice{Base.OneTo{Int64}},Int64,Int64}}; cap=v"3.5", kernel=true) So yeah, you'll probably need a better interface if you want to use CUDAnative as a PTX compiler. |
I got some code out with @SimonDanisch's hack. See if I can use them. Different calls to the same function generate the same ptx code except for symbols |
Before we delve into making CUDAnative behave like a static compiler, why can't you use the stack "as intended", ie. to execute GPU code (where the JIT compilation to PTX is an implementation detail)? Installation issues for CUDAnative should be resolved for 1.0 (they are already on 0.7). |
Just as a stop-gap measure. I don't know how far away 0.7 or 1.0 is. I
just introduced cudnn RNNs and the users want various 3-D indexing and cat
operations to support it. I have the option of writing kernels myself as
before (there are just too many combinations), steal from CUDAnative, or
just tell the users to wait. Of course once CUDAnative works out of the
box none of the static code will be necessary, at that point I will just
use CUDAnative with KnetArrays or use one of CuArrays, CLArrays if their
performance and coverage are sufficient.
…On Tue, Dec 5, 2017 at 7:41 PM Tim Besard ***@***.***> wrote:
Before we delve into making CUDAnative behave like a static compiler, why
can't you use the stack "as intended", ie. to execute GPU code (where the
JIT compilation to PTX is an implementation detail)? Installation issues
for CUDAnative should be resolved for 1.0 (they are already on 0.7).
—
You are receiving this because you were assigned.
Reply to this email directly, view it on GitHub
<#198 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABvNpu6iZkBZEwllp9mcdSwdnTJpTqgSks5s9XI-gaJpZM4QoQCJ>
.
|
Currently I only need this indexing: function getindex(a::KnetArray{T,3}, i::Colon, j, k::Colon) where T
s = size(a)
a = permutedims(a, [1,3,2])
a = reshape(a,(:,s[2]))
a = a[:,j]
a = reshape(a, (s[1],s[3],s[2]))
permutedims(a, [1,3,2])
end Could you help me with its |
No worries. I just made it a function. It's interesting that using 2D indexing with sub2ind is 20 times faster than using permutedims in my case: function getat(x::KnetArray{T,3}, i::Colon, j, k) where T
I,J,K = size(x)
k==(:) && (k=1:K)
x = reshape(x,(I,J*K));
js = repeat(j, outer=K);
ks = repeat(k, inner=J);
ix = sub2ind((J,K), js, ks);
x=x[:,ix]
reshape(x,(I,J,K))
end |
Hi @ngphuoc, you could also take a look to this PR #229, which amounts to function getindex(a::KnetArray, I...)
crange = CartesianRange(to_indices(a, I))
linind = [sub2ind(size(a), t.I...) for t in crange]
b = getindex(a, vec(linind))
shape = size(crange) # TODO drop scalar dimension
reshape(b, shape)
end
function setindex!(a::KnetArray, v, I...)
crange = CartesianRange(to_indices(a, I))
linind = [sub2ind(size(a), t.I...) for t in crange]
setindex!(a, v, vec(linind))
end |
That's excellent! Thanks @CarloLucibello |
It gives the following error on julia 0.6.2: julia> a[:,i,:];
ERROR: MethodError: Cannot `convert` an object of type Tuple{Base.Slice{Base.OneTo{Int64}},Array{Int64,1},Base.Slice{Base.OneTo{Int64}}} to an object of type CartesianRange
This may have arisen from a call to the constructor CartesianRange(...),
since type constructors fall back to convert methods.
Stacktrace:
[1] getindex(::Knet.KnetArray{Float32,3}, ::Colon, ::Array{Int64,1}, ::Colon) at ./REPL[11]:3
|
Supported indexes are columns, scalars and unit ranges for the time being |
I found |
CuArrays fallbacks fix this in 85c4125 |
The following would be useful:
The text was updated successfully, but these errors were encountered: