Universität Bonn: Autonomous
Intelligent Systems Institute for Computer Science VI: Autonomous Intelligent Systems  

Hannes Schulz (Staff member and PhD student)

Content

Latest Blog Posts

Hannes Schulz

GPU convolutions for neural networks

19 Jan 2012 Tagged With   cuv , gpu , benchmark

With all the popularity of deep learning, many researchers in the field might wonder which framework is "right" to implement their experiments. For plain neural networks, the main "work horse" is the matrix multiplication, which can be accelerated a lot using graphics processing units (GPU). For convolutional architectures, the matrix multiplication is typically "replaced" by a convolution, and we would also like to see them being fast(er) on GPU.

Neural net convolutions are somewhat special, since there filters are 3D and pool over input layers. Also, since they are usually applied to many small "maps" at once, common FFT acceleration techniques do not apply.

For my own implementations, I compared 3 convolution implementations:

  • The convolutions that come with Theano (from git, 2011-1-14). This implementation is by far the most flexible, as we will see. It is based on the formely separate, now theano-integrated CudaNdarray library.
  • Alex Krizhevsky, a PhD student in Toronto, wrote two publically available convolution routines. We already integrated the first version of his convolutions in CUV.
  • Alex' new convolutions created for the cuda-convnet (svn, 2011-1-13) which are described as being "several times faster" than the first version.

Constraints

The (main) constraints of the three versions are quite different:

ImplementationImage SizeMemory-Ordering (row-major)Other
Theanoany(nImages, nChannels, imageH, imageW)
Alex oldsquare only(nChannels, nImages, imageH*imageW)nFilters%2==0
Alex newsquare only(nChannels, imageH*imageW, nImages)nFilters%16==0

Regarding squared images, one can argue that in random image collections the shapes vary, anyway, and for batch processing it is necessary to square them.

The ordering is tricky. At first sight, Theano's ordering looks most intuitive. However, all operations which are functions of all channels of a single pixel are a bit tricky to optimize. Alex' old and new orderings can both use efficient matrix-row operations for cross-channel functions. The "Alex old" convolution has the disadvantage that images in one batch are not in the columns or the rows of a matrix, so that final "full" layers (for example in LeNet) require reordering the matrix. The new convolutions have images in the columns of a matrix, solving the reordering problem, even though this ordering looks most un-intuitive.

I should also mention the "sparse" filter option in Alex' code, which allows to convolve only certain maps with a filter. I'm not going into detail since Theano does not have this feature and I want to compare execution times.

Speed

In the following table, all operations were computed 10 times and the (wall clock) times averaged. For Theano, I varied the 'version' parameter, but found that the auto-selection (-1) selects the best algorithm. I used a GTX480 and in an Intel Xeon X5650 (2.67 GHz).

Execution speed of convolution packages
VersionImage SizeFilter SizeTypeTime (ms)Comment
Naive CPU32,8,176,17632,8,7,7fwd34200
dimg26800
dfltn/a
Alex new32,8,176,17632,8,7,7fwd75
dimg90
dflt55
trn0.3transposing all input batch
total220.3
Alex old32,8,176,17632,8,7,7fwd101
dimg240plus error padding (3 ms)
dflt115plus summing over batch (.8 ms)
total459
Theano32,8,176,17632,8,7,7fwd268
dimg451
dflt281
total1000

Key:

Image Size
batch size, number of input maps, height, width
Filter Size
number of output maps, number of input maps, height, width
Type
fwd is the "forward pass" convolution, dimg is the derivative w.r.t. the inputs and dflt is the derivative w.r.t. the filters.

Discussion: I was quite surprised to see Theano is comparably slow. It seems that Alex' new convolutions are indeed faster, albeit not several times (for the tested case) (Update: With patches for small batch sizes kindly provided by Alex, speed nearly doubled!). The overhead of a transpose (to comply with the "weird" memory layout) is negligible compared to the overall advantages. All GPU implementations significantly outperform a naive CPU version (just many nested for-loops). Note however that theano is able to generate code for efficient CPU convolutions.

Combinations: Theano is quite flexible, but "Alex new" is fast. How do we get the best of two worlds? It is interesting to note that the memory layouts of both convolutions are transposed to each other, and that for just 0.3 ms (in the above setting), we can get from one to the other. So we can get speed or flexibility at wish.

Maintenance concerns

Both implementations are not particularly very well documented, but well tested. At least for CudaNdarray, there is a successor on the way. It seems to me that optimized code at this level is mostly write-only anyway.

Enjoyed this post? Tell others about it! DeliciousDelicious | twitterTwitter | redditReddit |
blog comments powered by Disqus

Universität Bonn, Institute for Computer Science, Departments: I, II, III, IV, V, VI