## What are Weights and Biases

Consider the following forward propagation algorithm: $$~$ \vec{y_{n}}=\mathbf{W_n}^T \times \vec{y_{n-1}} + \vec{b_n} $~$$ where $~$n$~$ is the number of the layers, $~$\vec{y_n}$~$ is the output of the $~$n^{th}$~$ layer, expressed as a $~$l_n \times 1$~$ ($~$l_n$~$ is the number of neurons of the $~$n^th$~$ layer) vector. $~$\mathbf{W_n}$~$ is a $~$l_{n-1} \times l_{n}$~$ matrix storing all the weights of every connection between layer $~$n$~$ and $~$n-1$~$, thus needing to be transposed for the sake of the product. $~$\vec{b_n}$~$, again, is the biases of the connections between the $~$n^th$~$ and $~$(n-1)^th$~$ layers, in the shape of $~$l_n\times1$~$.

As one can see, both weights and biases are just changeable and derivable(thus trainable) factors that contributes to the final results.

## Why do we need both of them, and why are Biases Optional?

Neural network, indeed a better version of the perceptron model, where the output of each neuron(perceptron) owns a linear correlation with the output, rather than simply outputting plain 0/1. (This relation is further more projected to the activation function to make it non-linear, which will be discussed later)

To create a linear correlation, the easiest way is to scale the input with a certain coefficient $~$w$~$, output the scaled input. $$~$ f(x)=w\times x $~$$

This model works alright, even with one neuron it could perfectly fit a linear function like $~$f(x)=m\times x$~$, and certain non-linear relations could be fit with neurons work in layers.

However, this new neuron without biases, lack of a significant ability even comparing to perceptron: it always fires regardless the input thus failing to fit functions like $~$y=mx+b$~$. It's impossible to disable the output of a specific neuron on certain threshold value of the input. Even that adding more layers and neurons a lot eases and hides this issue, neural networks without biases are likely to perform a worse job than those with biases.(Consider the total layers/neurons are the same)

In conclusion, the biases are supplements to the weights to help a network better fit the pattern, which are not necessary but helps the network to perform better.

## Another way of writing the Forward Propagation

Interestingly, the forward propagation algorithm $$~$ \vec{y_{n}}=\mathbf{W_n}^T \times \vec{y_{n-1}} + 1 \times \vec{b_n} $~$$ could also be written like this: $$~$ \vec{y_{n}}= \left[ \begin{array}{c} x, \\ 1 \end{array} \right]^T \cdot \left[ \begin{array}{c} \mathbf{W_n}, \\ \vec{b_n} \end{array} \right] $~$$,which is $$~$ \vec{y_{n}} = \vec{y_{new_{n-1}}}^T \times \vec{W_{new}} $~$$. This is a way of rewriting the equation makes the adjustment by gradient really easy to write.

## How to update them?

It's super easy after the rewrite: $$~$ \vec{W_{new}} =\vec{W_{new}}-\frac{\delta W_{new}}{\delta Error} $~$$.

## The Activation Function

There is one more compoment yet to be mentioned--the Activation Function. It's basically a function takes the output of a neuron as an input and output whatever value defined as the final output of the neuron.
$$~$
\vec{W_{new}} =Activation(\vec{W_{new}}-\frac{\delta W_{new}}{\delta Error})
$~$$
There are copious types of them around, but all of them have at least one shared property that there are all *Non-linear*!

That's basically what they are designed for. Activation Functions project output to a non-linear function, thus introducing non-linearity into the model.

Consider non-linear-seperatable problems like the the XOR problem, giving the network the ability to draw non-linear sperators may help the classification.

Also, there's another purpose of the activation function, which is to project a huge input, into the space between -1 and 1, thus making the followed-up calculations easier and faster.

2017/10/15