![]() |
![]() |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
HomeGPU convolutions for neural networksWith all the popularity of deep learning, many researchers in the field might wonder which framework is "right" to implement their experiments. For plain neural networks, the main "work horse" is the matrix multiplication, which can be accelerated a lot using graphics processing units (GPU). For convolutional architectures, the matrix multiplication is typically "replaced" by a convolution, and we would also like to see them being fast(er) on GPU. Neural net convolutions are somewhat special, since there filters are 3D and pool over input layers. Also, since they are usually applied to many small "maps" at once, common FFT acceleration techniques do not apply. For my own implementations, I compared 3 convolution implementations:
ConstraintsThe (main) constraints of the three versions are quite different:
Regarding squared images, one can argue that in random image collections the shapes vary, anyway, and for batch processing it is necessary to square them. The ordering is tricky. At first sight, Theano's ordering looks most intuitive. However, all operations which are functions of all channels of a single pixel are a bit tricky to optimize. Alex' old and new orderings can both use efficient matrix-row operations for cross-channel functions. The "Alex old" convolution has the disadvantage that images in one batch are not in the columns or the rows of a matrix, so that final "full" layers (for example in LeNet) require reordering the matrix. The new convolutions have images in the columns of a matrix, solving the reordering problem, even though this ordering looks most un-intuitive. I should also mention the "sparse" filter option in Alex' code, which allows to convolve only certain maps with a filter. I'm not going into detail since Theano does not have this feature and I want to compare execution times. SpeedIn the following table, all operations were computed 10 times and the (wall clock) times averaged. For Theano, I varied the 'version' parameter, but found that the auto-selection (-1) selects the best algorithm. I used a GTX480 and in an Intel Xeon X5650 (2.67 GHz).
Key:
Discussion: I was quite surprised to see Theano is comparably slow. It seems that Alex' new convolutions are indeed faster, albeit not several times (for the tested case) (Update: With patches for small batch sizes kindly provided by Alex, speed nearly doubled!). The overhead of a transpose (to comply with the "weird" memory layout) is negligible compared to the overall advantages. All GPU implementations significantly outperform a naive CPU version (just many nested for-loops). Note however that theano is able to generate code for efficient CPU convolutions. Combinations: Theano is quite flexible, but "Alex new" is fast. How do we get the best of two worlds? It is interesting to note that the memory layouts of both convolutions are transposed to each other, and that for just 0.3 ms (in the above setting), we can get from one to the other. So we can get speed or flexibility at wish. Maintenance concernsBoth implementations are not particularly very well documented, but well tested. At least for CudaNdarray, there is a successor on the way. It seems to me that optimized code at this level is mostly write-only anyway. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||