"We have the face oval, two eyes, a nose and a mouth. For a CNN, a mere presence of these objects can be a very strong indicator to consider that there is a face in the image. Orientational and relative spatial relationships between these components are not very important to a CNN."
^^^ What? The opposite of this is the mainstream I thought. The promise of DL is to learn hierarchical models of your data. The network learns edge filters, learns combinations of edge filters that differentiate an eye vs a nose, but doesn't learn combinations of intermediate features that determine a face? ppl usually say with a deep enough network an hierarchical concept can be learned...
Hinton is with no doubt one of the greatest name in the field, but I think in this case particularly the paper fails to properly address citations. I have seen people publishing about dynamic routing and invariance problem by decades (e.g. C. Von der Malsburg, T. Poggio, and many others). But I admit, authors in general name concepts in such obscure and convoluted way, which makes it very hard to actually separate contributions and give credits to whom deserves it.
To play with Capsule Networks in practice, you can try this simple Keras implementation: https://github.com/XifengGuo/CapsNet-Keras
Someone please explain this to me: I get that CNN's are unfit for learning different poses, rotations and such. However, the I don't get the face example at all.
Let's say there is a layer of neurons that, after some convolution and pooling, get some features like "noseness", "eyeness" and "mouthness". Unless the pool size in the pooling layer was big enough to include the whole face in the same pool, the parts are still spatially separate (although lower in resolution than in the original image).
When there is the next convolution layer, isn't it going to learn that "kernels that have eyeness up, noseness in the middle and mouthness bottom are the most facelike", similarly how the earlier layers learned to identify parts?
Am I missing something, or is it just a bad example?
For once I'd like to see a writeup on how these capsules work (or don't.....) in non-CNNs.
MNIST is too complex, they still relied on conv layer for experiment and barely explained why it is robust. Others are also struggling on why vector is better than scalar activations. Somebody need to make 2D XOR classification example.
I’m eagerly waiting for someone to use the imageNet data on it and show some prelim results
At some point Hinton will bag a Nobel or equivalent for his contributions to machine learning.
tl;dr deep learning needs data to learn invariances, "capsules" build in invariances to 3D rotation... somehow...
Is this going pass the underwear stage stretching CNN to fit the Emperor's new clothes? Hinton's caught up in a mind boggling batch of mumbo-jumbo that proves no one understands how the brain actually works or is even close. Winter is nigh near. Till that time it's pin the tail on the donkey time...,