Computer Vision (CNNs)
Identity Mappings in Deep Residual Networks (He, Zhang, Ren, Sun) [Read: Sept 23rd]
- Idea: Passing more information forward in a network by establishing skip-connections can result in much deeper and more powerful networks. The authors of the original ResNets now expand on this idea by tweaking the connections a bit.
- (1) First they experiment with the type of skip-connections (as given by function h). They argue that the best type of skip connection is simply the identity function, largely because any other type of function (scalar multiplication, dropout, gate, conv, etc.) will result in vanishing or exploding gradients. A simpler way to see this is that any manipulations on the shortcuts can hamper information propagation and lead to optimization problems.
- (2) In a typical ResNet, the results of the Convs are added to the results of an identity output, and then a post-activiation ReLU is applied before passing the outputs to the next block. In the proposed "pre-activation" process, there is no longer a ReLU after the addition. Instead, the identity results are allowed to flow through completely unimpeded, and a pre-processing of [Batch Norm]->[ReLU] is applied to the Conv portion before multiplying the by the weights.
- The authors experimented with only have ReLU, rather than BN and ReLU in the pre-activation step, as well as a number of other combinations before arriving at the final model, which yielded the best empirical results.
- Results: New pre-activation units achieve gains over previous state-of-the-art original ResNets by roughly 10% from ~5.5. to ~4.8 on ImageNet. All of these gains allow for networks of over 1000 layers!
- Lessons: The authors believe this works well because first, the optimization is further eased since an identity mapping over a ReLU allows the signal to be passed freely backwards and forwards. Secondly, this works well because using BN as pre-activation improves regularization of the models.
Densely Connected Convolutional Networks (Huang, Liu, Weinberger) [Read Sept 15th]
- Idea: When dealing with CNNs, we've seen that passing more information over directly through ResNets or Highway Networks allows the final softmax classifier to perform better even when the networks have many layers. The authors take this one step further such that each layer obtains additional inputs from all preceding layers and passes on its own feature maps to all subsequent layers. Crucially, in contrast to ResNets, the model does not combine features through summation before they are passed into a layer, and instead the model provides all the features as separate inputs.
- One counter-intuitive point is that in some cases, DenseNets actually require fewer parameters than traditional networks because each subsequent layer requires fewer feature maps to perform well, and fewer feature maps overall means fewer weights to optimize. This is made possible because all layers in a DenseNet have direct access to feature maps from all preceding layers, which means that many of the feature maps (namely those from the beginning) are being reused by layers towards the end. In turn, this reuse lowers the burden on future layers to re-learn redundant feature maps. More concretely, in traditional ConvNets, the number of maps per layer is a medium size number (~64) and the number of connections grows linearly. In a DenseNet, the number of connections grows much faster, but the maps per layer is a small size number (~12). Thus, even though the number of connections grows quadratically, DenseNet layers start with a smaller base, so depending on the final depth, the total number of parameters might actually end up smaller as well.
- Results: The DenseNet beats the state-of-the-art on CIFAR-10, CIFAR-100 and SVHN by a decent margin, with improvements around 30% over previous methods. When the depth of the DenseNet is lowered such that the total number of params is roughly equal to the total number of params in a ResNet, the DenseNet performance is consistently lower by a factor of around 11%.
- Lessons: One potential side-effect of the more efficient use of parameters may be a tendency of DenseNets to be more robust against overfitting, which led to a lessened need for regularization. Dropout was only used for datasets without data augmentation, which are substantially smaller and thus more prone to overfitting. Separately, inspired by (He et al., 2016), composite layers that contain (Batch Norm)-(ReLU)-(ConvPool) seem to be great mechanisms for joining together network components.
Deep Networks with Stochastic Depth (Huang, Sun, Liu, Sedra, Weinberger) [Read Sept 26th]
- Idea: The expressive power of deep ConvNets with 100s of layers leads to significant improvements in performance. However, this does not come without a cost, since deep networks often exhibit problems of diminishing feature reuse in feed forward calculations, vanishing gradients during backprop, and slower training time. To address these problems, the authors propose a stochastic depth procedure where the network size is randomly cut shorter during training time (and is the full length during test time). Concretely, each mini-batch has a chance of dropping a subset of layers and bypassing those layers with the identity function. Each layer is dropped based on the outcome of an independent Bernoulli random variable, whose distribution is determined by a linear decay rate such that later layers (closer to output) have a higher likelihood of being dropped than earlier layers (closer to input).
- Results: Recall VGG-net and GoogLeNet have 19 and 22 layers respectively, from 2014. ResNets pushed this to 152 layers last year, and now in 2016, 1000+ layers (yes, thousand) can be trained effectively. This yields test errors below 5% on CIFAR, which is state-of-the-art during time of publication.
- Training time is no longer a function of total size, but rather total expected network size. Based on the proposed survival probability scheme, this amounts to ~25% time decrease over networks of the same constant size.
- Lessons: The random dropping of layers with stochastic depth functions a lot like the random dropping of units with dropout. Whereas the former performs operations on depth, the latter performs operations on width. Other notable similarities is that both seem to offer a form of regularization, as well as a makeshift ensemble of different network sizes.
Bridging the Gaps Between Residual Learning, Recurrent Neural Networks and Visual Cortex (Liao, Poggio) [Read Oct 3rd]
- Idea: ResNets have recently shown great promise in image recognition, but often require 1000+ neural units to achieve the SOTA accuracy. While impressive, the human brain does this with just dozens of neurons. One advantage the brain uses is lateral connections in addition to depth connections, so the authors propose combining the ResNet (depth) with RNN-like structure (lateral) which would more closely mimic how the visual cortex operates.
- In particular, they reformulate the ResNet architecture to have shared weights across "time-steps", where one time-step is a pass through one time-step of a ResNet block (BN-ReLU-Conv), representing the lateral connections. Next, they layer on multiple "RNN-ResNet" blocks in a hierarchical fashion to simulate the depth. This is essentially the form of a stacked-RNN. Finally, there is a pre-net and post-net for input preprocessing and post-processing, all trained jointly.
- Results: On the CIFAR-10 dataset, the new structure does not perform as well as original ResNets (~6% loss vs ~2% loss) by roughly a factor of 3x. This corresponds to the roughly 3x fewer params to train (100k vs 300k) because weights are shared. Those extra weights are clearly making a difference though, and can't just be tossed away.
- Remarks: While the biological inspiration is certainly interesting, the empirical results are lackluster. I'm sure a future architecture will find a way to combine all the hot ideas (CNNs, Skip-connections, GRUs, Batch Norm and let's throw in Dropout too), but we are not there yet.
(Huang, Ferraro, Mostafazadeh, Misra, Agrawal, Devlin, Girshick, et al.) [Read Oct 4th]
- Ideas: Image recognition has arguably surpassed human accuracy, and yet there still lacks a sense that machines are better than humans at storytelling. In particular, while machines can now be trained to give literal descriptions of an image (sun is setting), it still struggles to give colorful narratives filled with emotion or intent (sky is illuminated with brilliance as the gentle sun drops beneath the horizon). In order to tackle this, the authors developed a public dataset of sequences of images that offer more context about how situations evolve and give machines a chance to reason about changes over time.
- The dataset was created using thousands of Mechanical Turk workers, with multiple stages of processing and quality control. Concretely, rather than having the workers simply label the images using the instruction "describe all the important parts", the new instructions ask the turkers to tell a story. The authors then establish a baseline by building a network that generates stories.
- Results: Of course, grading a story is quite difficult, so after producing stories, a panel of human judges evaluated each story depending on how much they agreed with the statement “If these were my photos, I would like using a story like this to share my experience with my friends”. Then the authors tried a couple of different automatic evaluation techniques to find which metric most closely correlated with the human judges evaluations. To this end, the authors found that METEOR smoothed-BLEU is the most ideal automatic evaluation metric, an idea from C.Y. Lin and F.J. Och in 2004, from their talk Automatic Evaluation of Machine Translation
Quality Using Longest Common Subsequence and Skip-bigram Statistics.
- Lessons: New dataset (SIND), great win for the community!
End to End Learning for Self-Driving Cars (Bojarski, Testa, Dworakowski, Firner, Flepp, Goyal, Jackel, et al.) [Read Oct 10th]
- Idea: Memory Networks and CNNs trained end-to-end have shown great promise in a number of applications, so the authors from Nvidia thought about trying this to the self-driving car problem. While straightforward, the idea is also quite ambitious because there are many more steps to driving a car than a task like image captioning, which is already quite difficult. In particular, performing such a task would require recognizing lane markers (left to right steering), recognizing obstacles (breaking), ramping onto a highway (acceleration), recognizing signs (deceleration), different weather conditions, and much more!
- If done correctly though, the model would be quite a breakthrough because it could interact directly with the world, rather than interpreting its surroundings through a composition of human-selected signaling mechanisms.
- Results: "A small amount of training data from less than a hundred hours of driving was sufficient to train the car to operate in diverse conditions, on highways, local and residential roads in sunny, cloudy, and rainy conditions." Absolutely amazing, here's the video: https://drive.google.com/file/d/0B9raQzOpizn1TkRIa241ZnBEcjQ/view.
- There are still parts that can be improved, but clearly self-driving cars are going to be a real thing within the next year. For context, this paper was from April 2016.
Understanding deep learning requires rethinking generalization (Zhang, Bengio, Hardt, Recht, Vinyals) [Read Mar 24th]
-Idea: Neural networks have amazing power such that even given random labels or an image with random noise, the network can predict those classes with full accuracy on training data. Of course, the same model falls apart completely on the test data because all it has done is "create a hashmap" that memorized all the training data. It's not the models fault that it didn't pick up any salient features in the image - there are no meaningful features to pick up! Thus, if the network is so successful at memorizing features, then how is it able to generalize under normal circumstances? It seems like a network (such as Inception or AlexNet) is able thus able to learn useful features (such as lines or angles) when available, but will fall back to rote memorization when those signals aren't available. The final takeaway is that due to this divergence, classical statistical learning theory and regularization strategy can not explain the outstanding generalization ability of deep networks.
-Remarks: This paper won ICLR best paper award, but, in my humble opinion, doesn't seem all that ground-breaking. In particular, even though I am convinced that traditional statistical methods do not apply here, the authors don't offer an alternative. Moreover, the result just doesn't seem that surprising. It would not be surprising to hear that kids can memorize random vocabulary for an exam. And those same kids are also able learn vocabulary faster with the presence of useful features (for example, "bilateral" means two sided based on construction of prefix and suffix). The former task is harder and would take longer to do. Similarly, we see a CNN takes longer to converge when training on random noise. So basically, we're saying that a network can learn in multiple ways. Maybe what's amazing is that we can't explain how the network does that?
FractalNet (Larsson et al., 2016)
GoogLeNet with the “Inception module” (Szegedy et al., 2015)
Highway Networks (Srivastava et al., 2015)
Ladder Networks (Rasmus et al., 2015)
Actual CV tasks (Gardner et al., 2015) (Gatys et al., 2015)