Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Collect GeneralizedKroneckerProducts in-place #61

Open
elisno opened this issue May 6, 2020 · 10 comments
Open

Collect GeneralizedKroneckerProducts in-place #61

elisno opened this issue May 6, 2020 · 10 comments
Assignees

Comments

@elisno
Copy link
Contributor

elisno commented May 6, 2020

Using the vec-trick with higher order Kronecker products, such as

((τ1  τ2)  (τ3  τ4)) * vec(D)

should result in the lower-order products ( (τ1 ⊕ τ2) and (τ3 ⊕ τ4) ) being collected.

In my use case, the size of each (sparse) matrix is around N x N = 128 x 128. These matrices are stored in τ::Array{T<:Complex,4} with size(τ)= (N,N,M,L) . The sparsity pattern along M and L is assumed to be the same.
This would be more efficient if (sparse) workspace matrices are used to store the intermediate results,
reducing the number of allocations,


Last year, some work was started on kron! (JuliaLang/julia#31069) that computed the Kronecker product in-place.

  • Should this implementation by used? If so, should we wait for the PR to go through, or write these methods in Kronecker.jl?

  • Would it be enough to apply this to KroneckerProducts or do we need methods that specialize on other types, such as KroneckerSums and IndexedKroneckerProducts?

  • Currently sparse collects KroneckerSums. Shouldn't it be lazy by default to accommodate these in-place methods? I'm not sure how to generalize over all types, but do something like:

sparse(K::KroneckerSum) = sparse(K.A)  sparse(K.B)
@MichielStock
Copy link
Owner

Using the vec-trick with higher order Kronecker products, such as

((τ1  τ2)  (τ3  τ4)) * vec(D)

should result in the lower-order products ( (τ1 ⊕ τ2) and (τ3 ⊕ τ4) ) being collected.

In my use case, the size of each (sparse) matrix is around N x N = 128 x 128. These matrices are stored in τ::Array{T<:Complex,4} with size(τ)= (N,N,M,L) . The sparsity pattern along M and L is assumed to be the same.
This would be more efficient if (sparse) workspace matrices are used to store the intermediate results,
reducing the number of allocations,

This is good idea. It is simple to either

  • update the vec trick, such that vector products with Kronecker products of two Kronecker sums (or one) immediately collects them into sparse matrices.
  • update Kronecker directly such that when one of the matrices is a Kronecker sum, it collects the sparse forms.

I am leaning towards the second option to be the better one. What do you think?

@MichielStock MichielStock self-assigned this May 8, 2020
@MichielStock
Copy link
Owner

Last year, some work was started on kron! (JuliaLang/julia#31069) that computed the Kronecker product in-place.

  • Should this implementation by used? If so, should we wait for the PR to go through, or write these methods in Kronecker.jl?
  • Would it be enough to apply this to KroneckerProducts or do we need methods that specialize on other types, such as KroneckerSums and IndexedKroneckerProducts?

I like this idea. I guess we could add a function collectin!(C, K) or collect!(C, K), whatever is more idiomatic. Not sure what is the best way to use kron! as it is not shipped yet with the Base. We might copy the code for now and follow it up whenkron! becomes available in a stable version?

This trick will likely work with systems containing only a single Kronecker product, unless the code becomes more involved.

  • Currently sparse collects KroneckerSums. Shouldn't it be lazy by default to accommodate these in-place methods? I'm not sure how to generalize over all types, but do something like:
sparse(K::KroneckerSum) = sparse(K.A)  sparse(K.B)

Since A and B are often dense, I don't directly see how this would make sense. As I am aware, many of the shortcuts don't work with Kronecker sums as they work with Kronecker products. So, in many cases, you would want to work with the sparse version of the Kronecker sum, except for matrix exponentiation. Can you give some examples of how this would be beneficial?

@elisno
Copy link
Contributor Author

elisno commented May 8, 2020

This is good idea. It is simple to either

  • update the vec trick, such that vector products with Kronecker products of two Kronecker sums (or one) immediately collects them into sparse matrices.
  • update Kronecker directly such that when one of the matrices is a Kronecker sum, it collects the sparse forms.

I am leaning towards the second option to be the better one. What do you think?

Before, I've resorted to using the first option by collecting each sum (and manually writing out the vec-trick)

collect(τ3  τ4) * lmul!(D, collect(transpose(τ1  τ2)))

In my use case, I use lmul! because D is Diagonal. I

Each τ is essentially 4 versions of the sparse matrix t
(found in https://gist.github.com/elisno/b49ad947046180917bc8da400b57ca7e).

The τs are operated on by transpose(), adjoint(), conj() (and identity(), technically). τ1 and τ2 are multiplied by a scalar.

Explicitly collecting the sums (out-of-place) becomes expensive because I have hundreds / thousands of t matrices that appear to have the same sparsity structure.
I'd rather collect the non-zero values of t and update τx.nzval .= f(t.nzval) in some way.

My current version is hard to read and is not very robust.

Do I understand correctly that the second option would allow updating the non-zero values of τ1, τ2, τ3 and τ4 in-place? I.e. after updating, each term of Kronecker would still be a KroneckerSum? If so, this is a clever and clean solution. I'd most likely start using the "higher order vec-trick" with vec(D).


I guess we could add a function collectin!(C, K) or collect!(C, K), whatever is more idiomatic. Not sure what is the best way to use kron! as it is not shipped yet with the Base.

Perhaps a field for the matrix C should be added to KroneckerSum (similarily to the identity matrices discussed in #62)?


Since A and B are often dense, I don't directly see how this would make sense. As I am aware, many of the shortcuts don't work with Kronecker sums as they work with Kronecker products. So, in many cases, you would want to work with the sparse version of the Kronecker sum, except for matrix exponentiation. Can you give some examples of how this would be beneficial?

I'd say this is only important for collect/collect! when a Kronecker or KroneckerSum is composed of small dense matrices, based on the benchmarks in #64 (comment).

Admittedly, I don't have any specific usage for this, as I need to collect sums of larger, sparse matrices.

I'd rather consider this as a simple (optional) utility function we can use to store large (dense or sparse) matrices as sparse matrices in higher order products/sums. and allow the user to update them in-place in a manner of their choosing.

Here's a dummy example with calculating a linear combination of Kronecker products, Kronecker sums or the higher order example at the top of this issue.

# Small dense matrices

A = rand(4,4);
B = rand(3,3);
Anew = rand(4,4);
Bnew = rand(3,3);

result = kron(A,B) + kron(Anew, Bnew);
C = zeros(12,12);

K = A  B; # or ⊕
C .+= collect(K) # or an equivalent in-place method

K.A .= Anew;
K.B .= Bnew;

C .+= collect(K) # or an equivalent in-place method

C  result

If the sizes of A , B, etc. increase, I'd think one would only have to change how K is constructed and initialized:

K = sparse(A  B) # or ⊕, should not automatically collect to a sparse matrix
...
K.A.nzval .= vec(Anew)
K.B.nzval .= vec(Bnew)

@Roger-luo
Copy link

Roger-luo commented May 9, 2020

Hi I came across from @elisno message on discourse and his comment under my PR JuliaLang/julia#31069, I'll just reply both here so it's easier to keep track.

Generally speaking, the implementation of kron! and batched_kron! in our package Yao is mainly written for tests. https://github.com/QuantumBFS/YaoBase.jl/blob/master/src/utils/math.jl#L139

It was not well optimized for performance since in quantum circuits there are much more space to optimize the performance for quantum gates specifically. Thus sparse matrices type defined in LuxurySparse is not optimized specifically, but in principle by specializing kron! and batched_kron! on these matrice types should give a speedup. I'm glad to guide you guys if you want to work on this to make our implementation more general.

Regarding compatibility, you can depend on YaoBase for kron! interface compatibility across all Julia 1.x versions since we will need maintain that anyways.

@elisno
Copy link
Contributor Author

elisno commented May 9, 2020

I've looked over the kron/kron! and batched_kron/batched_kron! implementations in the Yao package suite. The performance of the dense products is quite impressive (particularly with batched_kron on the GPU).

I think it would be a good idea to use YaoBase for kron! until it's a part of Base.

In CuYao.jl, CuArray works with kron but I think it's not too difficult to add kron! as well. Kronecker.jl would finally have GPU compatibility if it depends on CuYao.
https://github.com/QuantumBFS/CuYao.jl/blob/master/src/CUDApatch.jl#L75-L91


I'm not sure how it would work with higher order products / sums or how lazy batches would look like, but batched_kron! could be used when collecting a linear combination of KroneckerProducts or KroneckerSums.

@MichielStock, what are your thoughts on using using and for Array{T,3} (and hopefully CuArray{T,3})?


I'm still trying to figure out how to collect (sparse) KroneckerSums on GPUs.

  • The kernel for dense kron is straight forward: Each thread gets a specific state of the resulting product.

  • The kernel for batched_kron launches a thread for each batch.

  • I'm not sure how the kernel for kron would look for CuSparseMatrixCSC. Would the kernel replace the two outermost loops in the CPU implementation of sparse kron!?

@Roger-luo
Copy link

I think we could consider move all kron! and batched_kron! to LuxurySparse.jl which probably makes more sense, you could also support more matrix types from there. What do you guys think? @MichielStock @elisno

@elisno
Copy link
Contributor Author

elisno commented May 10, 2020

For this issue, it would make sense to use LuxurySparse.jl for sparse kron! only for the CPU. I'll have to think more about how batched_kron! could be used with the current Kronecker types.

I'll leave the GPU implementation for another issue later.

@Roger-luo
Copy link

Yeah, I think the best way to support GPU is to have a new package for GPUs specifically since the best way to support conditional dependency is to have a new package as I discussed with Stefan a year ago, tho there are some plans to support conditional dependency in Pkg.

So a new special array package made for GPUs specifically would make more sense.

@MichielStock
Copy link
Owner

This is good idea. It is simple to either

  • update the vec trick, such that vector products with Kronecker products of two Kronecker sums (or one) immediately collects them into sparse matrices.
  • update Kronecker directly such that when one of the matrices is a Kronecker sum, it collects the sparse forms.

I am leaning towards the second option to be the better one. What do you think?

Before, I've resorted to using the first option by collecting each sum (and manually writing out the vec-trick)

collect(τ3  τ4) * lmul!(D, collect(transpose(τ1  τ2)))

In my use case, I use lmul! because D is Diagonal. I

Each τ is essentially 4 versions of the sparse matrix t
(found in https://gist.github.com/elisno/b49ad947046180917bc8da400b57ca7e).

The τs are operated on by transpose(), adjoint(), conj() (and identity(), technically). τ1 and τ2 are multiplied by a scalar.

Explicitly collecting the sums (out-of-place) becomes expensive because I have hundreds / thousands of t matrices that appear to have the same sparsity structure.
I'd rather collect the non-zero values of t and update τx.nzval .= f(t.nzval) in some way.

Are you sure you need to use collect? If your expression only contains only a single Kronecker sum/product it is faster to use to vec trick, which it falls back to automatically.

@MichielStock
Copy link
Owner

MichielStock commented May 11, 2020

Ok, I will summarize the discussion (as I see it)

  • we need a collectin!(C, K) function that collects GeneralizedKroneckerProducts in C, this should work for all kinds of Kronecker systems (kron prod, kron sum, kron exp etc).
  • I tend to add a new type of Kronecker for when you add multiple GeneralizedKroneckerProducts, i.e. A ⊗ B + C ⊗ D. In such cases, one can then use pre-allocated matrices to collect them.
  • If the above works, we might need a separate package to extend this to GPU.

Do I have the gist of it? If so, I would move these to respective issues.

Further thoughts:

  • not sure if the current package can support Array{T,3}. If this can be branched from the root of my type three in separate functionality, I am game.
  • kron! is ideally imported from another package.
  • also caution we don't try to emulate the functionality of TensorOperations.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants