# Difference Between Weights and Biases: Another way of Looking at Forward Propagation

https://arbital.com/p/8r4

by Alto Clef Oct 15 2017 updated Oct 15 2017

My understanding on Forward Propagation

## What are Weights and Biases

Consider the following forward propagation algorithm: $$\vec{y_{n}}=\mathbf{W_n}^T \times \vec{y_{n-1}} + \vec{b_n}$$ where $n$ is the number of the layers, $\vec{y_n}$ is the output of the $n^{th}$ layer, expressed as a $l_n \times 1$ ($l_n$ is the number of neurons of the $n^th$ layer) vector. $\mathbf{W_n}$ is a $l_{n-1} \times l_{n}$ matrix storing all the weights of every connection between layer $n$ and $n-1$, thus needing to be transposed for the sake of the product. $\vec{b_n}$, again, is the biases of the connections between the $n^th$ and $(n-1)^th$ layers, in the shape of $l_n\times1$.

As one can see, both weights and biases are just changeable and derivable(thus trainable) factors that contributes to the final results.

## Why do we need both of them, and why are Biases Optional?

Neural network, indeed a better version of the perceptron model, where the output of each neuron(perceptron) owns a linear correlation with the output, rather than simply outputting plain 0/1. (This relation is further more projected to the activation function to make it non-linear, which will be discussed later)

To create a linear correlation, the easiest way is to scale the input with a certain coefficient $w$, output the scaled input. $$f(x)=w\times x$$

This model works alright, even with one neuron it could perfectly fit a linear function like $f(x)=m\times x$, and certain non-linear relations could be fit with neurons work in layers.

However, this new neuron without biases, lack of a significant ability even comparing to perceptron: it always fires regardless the input thus failing to fit functions like $y=mx+b$. It's impossible to disable the output of a specific neuron on certain threshold value of the input. Even that adding more layers and neurons a lot eases and hides this issue, neural networks without biases are likely to perform a worse job than those with biases.(Consider the total layers/neurons are the same)

In conclusion, the biases are supplements to the weights to help a network better fit the pattern, which are not necessary but helps the network to perform better.

## Another way of writing the Forward Propagation

Interestingly, the forward propagation algorithm $$\vec{y_{n}}=\mathbf{W_n}^T \times \vec{y_{n-1}} + 1 \times \vec{b_n}$$ could also be written like this: $$\vec{y_{n}}= \left[ \begin{array}{c} x, \\ 1 \end{array} \right]^T \cdot \left[ \begin{array}{c} \mathbf{W_n}, \\ \vec{b_n} \end{array} \right]$$,which is $$\vec{y_{n}} = \vec{y_{new_{n-1}}}^T \times \vec{W_{new}}$$. This is a way of rewriting the equation makes the adjustment by gradient really easy to write.

## How to update them?

It's super easy after the rewrite: $$\vec{W_{new}} =\vec{W_{new}}-\frac{\delta W_{new}}{\delta Error}$$.

## The Activation Function

There is one more compoment yet to be mentioned--the Activation Function. It's basically a function takes the output of a neuron as an input and output whatever value defined as the final output of the neuron. $$\vec{W_{new}} =Activation(\vec{W_{new}}-\frac{\delta W_{new}}{\delta Error})$$ There are copious types of them around, but all of them have at least one shared property that there are all Non-linear!

That's basically what they are designed for. Activation Functions project output to a non-linear function, thus introducing non-linearity into the model.

Consider non-linear-seperatable problems like the the XOR problem, giving the network the ability to draw non-linear sperators may help the classification.

Also, there's another purpose of the activation function, which is to project a huge input, into the space between -1 and 1, thus making the followed-up calculations easier and faster.

2017/10/15