"My understanding is that [8qg neural nets alrea..."

However, this sounds a little bit strange and anti\-intuitive: do we really need to map everything into such a high\-dimensional space, in order to just classify 10 different digits? Neural networks seems to be somehow a mimic of brain, but my brain \(at least mine\) recognizing a digit does not seems to be relying on almost a thousand discrete features of that specific digit, no need to mention that the size of digits in real life could vary vastly according to multiple factors like distance\. Do we really need all these features to perform the classification, or can we just first extract less but more pivotal features out the raw image?

My understanding is that neural nets already determine the key features that are important to the decision. The importance of a given feature is represented by the weight on a particular neuron/input-feature.

So no we don't need every feature. We just need all features relevant to the decision. So some amount of pre-processing can definitely help.

Comments

Alto Clef

You are right.

In contrast, without manual optimization, a huge problem is that the neural networks are learning features limited in the training set (like the position of the digit) the do not apply to the testing set.

This makes regular DNNs really prone to position shifting. The CNN model, in my understanding, is really more like a work around, which is manually telling the network to separate two kinds of features from the raw pixels--the actual features and their locations--as feature maps, preventing the network being confused by the features shifting between different location.

However, this also means CNNs make the assumption that all the features are at the same size of the reception field, thus making it prone to shape/size shifting.

If my understanding above is correct, maybe it's possible to design a model, like CNNs, but instead, manually telling the network to separate three features from the raw pixels--features, locations and sizes. This way, maybe it's possible to create an architecture resistant to size/shape/location shifting.

As far as I thought, one possible way of doing that is, instead of treating a picture as (raw pixels in DNN)/(raw feature maps in RNN), treating the picture as vectors or even Bézier curves, thus the features extracted, such as the number of closed areas, are no longer depended to any of the fore-mentioned shifts. However, the actual way of doing it is still under my experiment.

The above are just naive thought from a beginner in machine learning, and I can't help but wanting to express them. If there's any errors and/or there are already existed matured architectures fit my description above, please let me know so I could improve myself…. Thanks a lot : ).