Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Some facts about the runtime speed. #15

Open
SUZhaoyu opened this issue Dec 25, 2018 · 1 comment
Open

Some facts about the runtime speed. #15

SUZhaoyu opened this issue Dec 25, 2018 · 1 comment

Comments

@SUZhaoyu
Copy link

SUZhaoyu commented Dec 25, 2018

Hi I tried to reimplement the similar operation as yours in Tensorflow and found two facts w.r.t the runtime speed performance:

  1. The gpu kernel which adds the filter gradients from one batch to another has almost no influence on the speed performance, in fact, the original MXNet implementation also applies this idea.

  2. Splitting back propagations for different inputs variables into different TF ops does help to accelerate the runtime speed, but there's only 30% boost observed, compared with wrapping them into one TF op.

I think the straggler is most likely to be the im2col/col2im operation, which is implemented in pure cuda code with little optimizations (compared with CuDNN). And the Author of Deform Conv also admitted that the main downside of their implementation is that they did not apply any CuDNN for the optimization (sorry I cannot find the origin).

Hopefully, these results can be helpful for those who are also interested in the Deform Conv implementation in Tensorflow, especially when the Deform Conv V2 paper has been released recently.

Any comments or further discussion are welcomed and Merry Christmas!

@SUZhaoyu
Copy link
Author

SUZhaoyu commented Dec 25, 2018

Btw, below is the command line I used for compilation and has been tested to be compatible with TF 1.12, g++ 4.9 and CUDA 9.0 without involving any changes to the source code.

For nvcc:

TF_INC=$(python -c 'import tensorflow as tf; print(tf.sysconfig.get_include())')
CUDA_HOME=/usr/local/cuda-9.0
nvcc -ccbin=/usr/bin/g++-4.9 -std=c++11 -c -o deform_conv.cu.o deform_conv.cu.cc -I $TF_INC -D GOOGLE_CUDA=1 -x cu -Xcompiler -fPIC -L CUDA_HOME/lib64/ -I /usr/local/ -I $TF_INC/external/nsync/public --expt-relaxed-constexpr -DNDEBUG -gencode arch=compute_61,code=sm_61 

Note: You may need to change the compute_61,code=sm_61 so that it is compatible to the computation capability of your GPU. And you may also need to change the CUDA_HOME accordingly.

For g++:

TF_INC=$(python -c 'import tensorflow as tf; print(tf.sysconfig.get_include())')
TF_LIB=$(python -c 'import tensorflow as tf; print(tf.sysconfig.get_lib())')
if [ ! -f $TF_INC/tensorflow/stream_executor/cuda/cuda_config.h ]; then
    cp ./cuda_config.h $TF_INC/tensorflow/stream_executor/cuda/
fi
CUDA_HOME=/usr/local/cuda-9.0
g++-4.9 -std=c++11 -shared -o deform_conv.so deform_conv.cc deform_conv.cu.o -D_GLIBCXX_USE_CXX11_ABI=0 -I $TF_INC -fPIC -L $CUDA_HOME/lib64 -lcudart -D GOOGLE_CUDA=1 -Wfatal-errors -I $CUDA_HOME/include -L $TF_LIB -ltensorflow_framework

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant