One of the most striking facts about neural networks is that it is guaranteed they can closely approximate any function $f(x)$.

❗️ Note this post is not interactive, but you can give the respective jupyter notebook a spin here.

image.png Importantly, the above hold even in the case of many inputs and outputs $f(x_1,…,x_m)$ as in multi-label problems for example.

This observation is in line with the universality theorem about neural networks.

Universality theorems are a commonplace in computer science, so much so that we sometimes forget how astonishing they are. But it’s worth reminding ourselves: the ability to compute an arbitrary function is truly remarkable. Almost any process you can imagine can be thought of as function computation. It is important to note that

  • The accuracy $ϵ$ of an approximation for a function $f(x)$ is corrrelated with the number of hidden neurons in the neural network. In other words, a neural network with a single hidden layer can be used to approximate any continuous function to any desired precision.
  • Only continuous functions can be approximated. In general discontinuous functions are generally not possible to approximate using a neural network. In practice, this is not usually an important limitation.

Forward Pass

Activation function: the sigmoid

In neural networks, the activation function of a neuron defines the output of that neuron given an input or set of inputs.

A sigmoid function (1) maps any value to a value between $[0,1]$. We use it to convert numbers to probabilities. It also has several other desirable properties for training neural networks. One of those desirable properties of the sigmoid is that this function is differentiable. This means that the output of the sigmoid function can be used to calculate its derivative. This is very useful in backpeopagation, a process through which the neural network learns from its own mistake in inference.

\[\sigma (z) = \frac{1}{1+e^{-z}} \ (1)\]

The hidden neuron

Hidden neurons

  • receive an input of $wx+b$, where $x$ training example, $w$ the weight of a coonection or synapse and $b$ a bias unit.
  • give the output of its activation function e.g. sigmoid function $\sigma(wx+b)$

image.png

To get a feel of how different values of weight $w$ and bias $b$ can affect the output of $n_1$, play around with the sliders in the plot below.

%matplotlib inline
import numpy as np
import ipywidgets as widgets
import matplotlib.pyplot as plt
from IPython.html.widgets import interactive
from IPython.display import display

# sigmoid function
def sigmoid(z,deriv=False):
    if(deriv==True):
        return z*(1-z)
    return 1/(1+np.exp(-z))
 

def sigmoid_demo(w=5,b=-2):
    x = np.linspace(-1,1,101)
    z = w*x + b
    s = sigmoid(z, deriv=False)

    fig = plt.figure(figsize=(6,16))
    ax1 = fig.add_subplot(3, 1, 1)
    # ax1.set_xticks([])
    # ax1.set_yticks([])
    plt.plot(x,s,lw=2,color='red')
    plt.xlim(0, 1)
    # plt.ylim(0, 1)
    plt.title("Output of $n_1$ hidden neuron", fontsize=16)
    plt.xlabel("$x$", fontsize=14)
    plt.ylabel("$\sigma(x)$", fontsize=14)

sigmoid_widget=widgets.interactive(sigmoid_demo)
w_slider = widgets.IntSlider(description='weight:', min=-20, max=500, step=1, value=5)
b_slider = widgets.IntSlider(description='bias:', min=-500, max=500, step=1, value=-2)
sigmoid_widget=widgets.interactive(sigmoid_demo,w=w_slider,b=b_slider)
display(sigmoid_widget)
interactive(children=(IntSlider(value=5, description='weight:', max=500, min=-20), IntSlider(value=-2, descrip…

By changing the values of

  • bias, you can see how it makes the graph to move in the direction of x, that is right or left. It does not affect the shape of the graph though.
  • weight, you can see how the curvature of the neuron output changes.

Step function

Now try to set w=500. As you do, the curve gets steeper, until eventually it begins to look like a step function. Try to adjust the bias to bias=-250 so the step occurs near $x=0.5$. We see that if this set of weights and biases were used by the network, it would create a “barrier” at a position determined as

$s=-b/w_{step}$,

where $s$ the point at which the step occurs, and $w_{step}»0$ ensuring the curvature is always very steep to approximate the step function.

In a scenario like that, the neuron’s output would generate a binary decision boundary where

  • $a_1=0$ if $x<s$
  • $a_1=1$ if $x>s$
sigmoid_widget=widgets.interactive(sigmoid_demo)
w_slider = widgets.IntSlider(description='weight:', min=-20, max=500, step=1, value=500)
b_slider = widgets.IntSlider(description='bias:', min=-500, max=500, step=1, value=-250)
sigmoid_widget=widgets.interactive(sigmoid_demo,w=w_slider,b=b_slider)
display(sigmoid_widget)
interactive(children=(IntSlider(value=500, description='weight:', max=500, min=-20), IntSlider(value=-250, des…

Layer 2: Combining activation functions

Let’s assume that we pass $n_2$ a similarly large $w_{step}$ and calculate the activation of $n_3$ which will be

$output_{n_3} = w_3a_1 + w_4a_2 +b$

image.png

Let’s see below how that looks:

def nn_demo(w3=10, w4=10, b1=-2, b2=-2):
    x = np.linspace(-1,1,101)
    w_step = 1000
    a1 = sigmoid(w_step*x + b1)
    a2 = sigmoid(-w_step*x + b1) #Notice that we are using -w_step to ensure a step - feel free to experiment with other values.
    z = (w3*a1 + w4*a2 + b2)
    s = sigmoid(z)

    fig = plt.figure(figsize=(6,16))
    ax1 = fig.add_subplot(3, 1, 1)
    # ax1.set_xticks([])
    # ax1.set_yticks([])
    plt.plot(x,s,lw=2,color='red')
    # plt.xlim(0, 1)
    # plt.ylim(0, 1)
    plt.title("Output of $n_3$ hidden neuron", fontsize=16)
    plt.xlabel("$x$", fontsize=14)
    plt.ylabel("$\sigma(x)$", fontsize=14)

output_widget=widgets.interactive(nn_demo)
w3_slider = widgets.FloatSlider(description='weight3:', min=-1, max=1, step=0.1, value=0.5)
w4_slider = widgets.FloatSlider(description='weight4:', min=-1, max=1, step=0.1, value=-0.5)
b1_slider = widgets.IntSlider(description='bias1:', min=-1000, max=1000, step=1, value=400)
b2_slider = widgets.IntSlider(description='bias2:', min=-1000, max=1000, step=1, value=-400)
output_widget=widgets.interactive(nn_demo, w3=w3_slider, w4=w4_slider, b1=b1_slider, b2=b2_slider)
display(output_widget)
interactive(children=(FloatSlider(value=0.5, description='weight3:', max=1.0, min=-1.0), FloatSlider(value=-0.…

In the above we can observe how adjusting the

  • biases determine the position of steps in the output function $a_3$
  • weights determine the height of the steps in the output function $a_3$

It follows that, the number of steps in $a_3$ would increase with the number of neurons in the previous hidden layer. In essence, the number of neurons in the layer determines the precision of our function approximator, in line with the universal function approximation theorem.

More neurons, more “steps”: stairway to fine-grain resulution

Let’s add another two neurons to the existing hidden layer ($a_4$ and $a_5$ if figure below), so that we appreciate what is the effect of number of neurons to the output of the neural network.

image.png

def biggernn_demo(w3=10, w4=10, w5=10, w6=10, b1=-2, b2=-2, ):
    x = np.linspace(-1,1,101)
    w_step = 1000
    a1 = sigmoid(2*w_step*x + b1)
    a2 = sigmoid(-2*w_step*x + b1)
    a3 = sigmoid(w_step*x + b1)
    a4 = sigmoid(-w_step*x + b1)
    z = (w3*a1 + w4*a2 + w5*a3 + w6*a4 + b2)
    s = sigmoid(z)

    fig = plt.figure(figsize=(6,16))
    ax1 = fig.add_subplot(3, 1, 1)
    # ax1.set_xticks([])
    # ax1.set_yticks([])
    plt.plot(x,s,lw=2,color='red')
    # plt.xlim(0, 1)
    # plt.ylim(0, 1)
    plt.title("Output of $n_3$ hidden neuron", fontsize=16)
    plt.xlabel("$x$", fontsize=14)
    plt.ylabel("$\sigma(x)$", fontsize=14)

output_widget=widgets.interactive(biggernn_demo)
w3_slider = widgets.FloatSlider(description='weight3:', min=-1, max=1, step=0.1, value=0.7)
w4_slider = widgets.FloatSlider(description='weight4:', min=-1, max=1, step=0.1, value=0.3)
w5_slider = widgets.FloatSlider(description='weight5:', min=-1, max=1, step=0.1, value=-0.4)
w6_slider = widgets.FloatSlider(description='weight6:', min=-1, max=1, step=0.1, value=0.1)
b1_slider = widgets.IntSlider(description='bias1:', min=-1000, max=1000, step=1, value=220)
b2_slider = widgets.IntSlider(description='bias2:', min=-1000, max=1000, step=1, value=-67)

output_widget=widgets.interactive(biggernn_demo, w3=w3_slider, w4=w4_slider, w5=w5_slider, w6=w6_slider, b1=b1_slider, b2=b2_slider, b3=b3_slider, b4=b4_slider)
display(output_widget)
interactive(children=(FloatSlider(value=0.7, description='weight3:', max=1.0, min=-1.0), FloatSlider(value=0.3…

We can see that the number of neurons are psitively correlated with the number of changes or rate in the output function i.e. $n_{neurons} \propto n_{critical\ points}$. Using the default parameter values above we can see how our neural network approximates a QRS wave.

It follows that, as we increase the number of neurons in aingle layer neural network we increase the precision at which the model can approximate a function to an arbitrary $\epsilon$ precision, \(n_{neurons} \propto \epsilon\)

In essence, this describes the “Universality Theorem” in neural networks for which you can find out more in a follow up notebook with examples.

References