Summary of ImageNet Classification With Deep Convolutional Neural Networks

3 min readJun 9, 2022


This is a a summarize of the following article :ImageNet Classification with Deep Convolutional Neural Networks, required for Holberton School Project


Until recently, datasets of labeled images were relatively small — on the order of tens of thousands of images. Simple recognition tasks can be solved quite well with datasets of this size, especially if they are augmented with label-preserving transformations. But objects in realistic settings exhibit considerable variability, so to learn to recognize them it is necessary to use much larger training sets.

Convolutional neural networks constitute one class of models [16, 11, 13, 18, 15, 22, 26]. CNNs have been prohibitively expensive to apply in large scale to high-resolution images. Recent datasets such as ImageNet contain enough labeled examples to train such models without severe overfitting.

In this article, they wrote a highly-optimized GPU implementation of 2D convolution and all the other operations inherent in training convolutional neural networks. Their network contains a number of new features which improve its performance and reduce its training time. The size of their network made overfitting a significant problem, even with 1.2 million labeled examples.


They used a dataset of over 15 million labeled high-resolution images belonging to roughly 22,000 categories. The images were collected from the web and labeled by human labelers using Amazon’s Mechanical Turk crowd-sourcing tool.

In their architecture, they used the ReLU Nonlinearity for the CNN, training it with multiple GPUs. They also used the local response normalization and overlapping pooling techniques. And in order to reduce overfitting, they’ve choosed to use data augmentation and dropout techniques.

They trained their models using stochastic gradient descent with a batch size of 128 examples, momentum of 0.9, and weight decay of 0.0005. They found that this small amount of weight decay was important for the model to learn. In other words, weight decay here is not merely a regularizer: it reduces the model’s training error


Their network achieves top-1 and top-5-test set error rates of 37.5% and 17.0%. The best performance achieved during the ILSVRC-2010 competition was 47.1% and 28.2% with an approach that averages the predictions produced from six sparse-coding models trained on different features.

Figure 3 shows the convolutional kernels learned by the network’s two data-connected layers. The network has learned a variety of frequency- and orientation-selective kernels, as well as various col-colored blobs.

Figure 4: (Left) Eight ILSVRC-2010 test images and the five labels considered most probable by their model. The correct labels are written under each image, and the probability assigned to the correct label is also shown with a red bar.

Even off-center objects, such as the mite, can be recognized by the net. Figure 4 shows five images from test set and six images from training set that are most similar to each of them according to this measure. They present the results for many more test images in the supplementary material.


Their results show that a large, deep convolutional neural network is capable of achieving record-breaking results on a highly challenging dataset using purely supervised learning. They still have many orders of magnitude to go in order to match the infero-temporal pathway of the human visual system.

Personal Notes

It’s interesting to see the application of the Convolution and pooling methods, and also very pleasant to see what we’ve learned about optimization using dropout and data augmentation techniques. Even if their some concepts that I still need to understand correctly, this article brighten those concepts, it serves as a concrete example of how to use convolution on complex problematic with realistic settings. I’m hoping to see further more examples with Holberton School.