CUDA CNN

Github: https://github.com/maximelianos/cudacnn

Neural encoder-decoder image denoiser written in C++/CUDA completely from scratch, with (batch, layer, height, width) data support. Model is trained with Python Keras, weights are exported in text format, inference in C++.

Build

Developed with API CUDA, OS Ubuntu 14.04 x86-64, cudatoolkit v8.0, gcc v4.8.4, GPU NVIDIA GeForce GTX 770M, compute capability 3.0. Change compute capability of your GPU in Makefile.

make
./denoiser --benchmark <n> --model <model.txt> <image.png>

Options

--benchmark <n> - run model n times on GPU, show average run time
--model <model.txt> - load model weights from file
<image.png> - image to be processed. Output image path is image_denoised.png

Remove generated files: make distclean

Code structure

src/ - C++/CUDA program
autoencoder.ipynb - Python notebook for training model in Keras (original notebook)
model2.txt - all model weights in text format
desc2.txt - model architecture

Speed comparison

Several optimizations were attempted:

Join convolution and activation layer into one CUDA kernel (conv+act)
Parallel execution of convolution channels in one layer; recompute grid size for each layer (effective threads)
Reduce thread block size from 8 to 4 (block=4)
Copy input layer to local shared memory (shared mem)

All convolutions have 3x3 range. Convolution weights of one layer have shape (3, 3, input_c, output_c) where input_c is number of channels in previous features, output_c is similar. The baseline implementation has separate convolution and activation kernels, threads are parallel by output feature height and width, thread grid is not adjusted between layers (which have different dimensions). Joint convolution and activation does not improve speed a lot. Recomputing the grid size and channel-wise parallelism removes most of non-functioning threads, and allows more threads to be executed at once. Small images have few kernels to execute, thus the speedup is larger. Reducing thread block size makes fewer threads that are out of image bounds. Shared memory is hard to implement, synchronization between threads may reduce performance.

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
src		src
.gitignore		.gitignore
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
autoencoder.ipynb		autoencoder.ipynb
desc2.txt		desc2.txt
digit.png		digit.png
digit2.png		digit2.png
digit3.png		digit3.png
model2.txt		model2.txt
runtime.png		runtime.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CUDA CNN

Build

Code structure

Speed comparison

References

About

Releases

Packages

Languages

License

maximelianos/cudacnn

Folders and files

Latest commit

History

Repository files navigation

CUDA CNN

Build

Code structure

Speed comparison

References

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages