Difference Between Weights and Biases: Another way of Looking at Forward Propagation

What are Weights and Biases

Consider the following forward propagation algorithm: $\vec{y_{n}}=\mathbf{W_n}^T \times \vec{y_{n-1}} + \vec{b_n}$ where $n$ is the number of the layers, $\vec{y_n}$ is the output of the $n^{th}$ layer, expressed as a $l_n \times 1$ ( $l_n$ is the number of neurons of the $n^th$ layer) vector. $\mathbf{W_n}$ is a $l_{n-1} \times l_{n}$ matrix storing all the weights of every connection between layer $n$ and $n-1$ , thus needing to be transposed for the sake of the product. $\vec{b_n}$ , again, is the biases of the connections between the $n^th$ and $(n-1)^th$ layers, in the shape of $l_n\times1$ .

As one can see, both weights and biases are just changeable and derivable(thus trainable) factors that contributes to the final results.

Why do we need both of them, and why are Biases Optional?

Neural network, indeed a better version of the perceptron model, where the output of each neuron(perceptron) owns a linear correlation with the output, rather than simply outputting plain 0/1. (This relation is further more projected to the activation function to make it non-linear, which will be discussed later)

To create a linear correlation, the easiest way is to scale the input with a certain coefficient $w$ , output the scaled input. $f(x)=w\times x$

This model works alright, even with one neuron it could perfectly fit a linear function like $f(x)=m\times x$ , and certain non-linear relations could be fit with neurons work in layers.

However, this new neuron without biases, lack of a significant ability even comparing to perceptron: it always fires regardless the input thus failing to fit functions like $y=mx+b$ . It's impossible to disable the output of a specific neuron on certain threshold value of the input. Even that adding more layers and neurons a lot eases and hides this issue, neural networks without biases are likely to perform a worse job than those with biases.(Consider the total layers/neurons are the same)

In conclusion, the biases are supplements to the weights to help a network better fit the pattern, which are not necessary but helps the network to perform better.

Another way of writing the Forward Propagation

Interestingly, the forward propagation algorithm $\vec{y_{n}}=\mathbf{W_n}^T \times \vec{y_{n-1}} + 1 \times \vec{b_n}$ could also be written like this: $\vec{y_{n}}= \left[ \begin{array}{c} x, \\ 1 \end{array} \right]^T \cdot \left[ \begin{array}{c} \mathbf{W_n}, \\ \vec{b_n} \end{array} \right]$ ,which is $\vec{y_{n}} = \vec{y_{new_{n-1}}}^T \times \vec{W_{new}}$ . This is a way of rewriting the equation makes the adjustment by gradient really easy to write.

How to update them?

It's super easy after the rewrite: $\vec{W_{new}} =\vec{W_{new}}-\frac{\delta W_{new}}{\delta Error}$ .

The Activation Function

There is one more compoment yet to be mentioned--the Activation Function. It's basically a function takes the output of a neuron as an input and output whatever value defined as the final output of the neuron. $\vec{W_{new}} =Activation(\vec{W_{new}}-\frac{\delta W_{new}}{\delta Error})$ There are copious types of them around, but all of them have at least one shared property that there are all Non-linear!

That's basically what they are designed for. Activation Functions project output to a non-linear function, thus introducing non-linearity into the model.

Consider non-linear-seperatable problems like the the XOR problem, giving the network the ability to draw non-linear sperators may help the classification.

Also, there's another purpose of the activation function, which is to project a huge input, into the space between -1 and 1, thus making the followed-up calculations easier and faster.

2017/10/15