Skip to content

ML 10 DL

Links

Forward Pass

Key concepts:

  • The activation function is a hyperparameter, the weights & biases are parameters.

Key terms:

  • TLUs (threshold logic units): calculate a weighted sum of inputs ⟶ apply a threshold to produce a binary output

  • FNN (feedforward neural network): The architecture that the signal flows only in one direction from the inputs to the outputs

  • DNN (deep neural network): when an ANN contains a deep stack of hidden layers

Steps:

  1. TLU computes weighted sum of inputs (IN & input weight). (Becomes x-axis value)
    \(z = w_1 x_1 + w_n x_n = X^T w\)

  2. TLU applies a step function to this sum. (Becomes y-axis value.)
    \(h_w(x) = step(z), where z=X^T w\)

ANNs

Sam

The ANN is a simple model of the biological neuron.

An artificial neuron contains:

  • 1+ input neurons

  • 1 output neuron

  • Connections between these. If a threshold number of connections are reached, the ON is activated.

We can build a network of artificial neurons that computes any logical proposition you want.

MLP

An MLP is composed of:

  • 1 input layer (passthrough)

  • 1+ hidden layers of TLUs (threshold logic units)

  • 1 output layer of TLUs

Notes

  • Every layer except the output layer includes a bias neuron and is fully connected to the next layer.

Equation

Sam

Outputs of fully connected layer = \(h_{W,b} (X) = \Theta(XW + b)\)

X = our dataset (matrix of input features)

  • 1 row per instance

  • 1 column per feature

W = weight matrix

  • 1 row per input neuron (IN)

  • 1 column per artifical neuron (AN) in the layer

b = bias vector, contains all connection weights between bias neuron & AN

  • 1 bias term per AN

\(\Theta\) = activation function

Sam

Pg 290: Backpropagation is Gradient Descent but using an efficient technique for computing the gradients automatically.

  • Forward | Make prediction, measure total error

  • Backward | (in reverse) Go through each layer to measure each connection's error contribution

  • Gradient descent | Tweak connection weights

Sam

Backpropagation computes the gradients of cost function for every model parameter using reverse-mode autodiff

  1. (Forward) Feed into network

  2. For each layer, the output is found based on connection (weight & bias) Note that the connection is not linear so that we can take derivative using the chain rule.

  3. Finds total network error

  4. (Backwards) Uses chain rule to find how much each connection contributed to total error working from final layer to initial layer

  5. (Gradient descent) Adjust the connection weights

Hyperparameters

Pg 323 | Paper by Leslie Smith

Sam

  • # hidden layers: Start with 1 or 2 hidden layers. Early layers find simple patterns, later layers find complex. Add until we start overfitting.

  • # neurons per hidden layer: Typically use the same for each (100), but could try adding more neurons to early layers if needed.

  • Learning rate: Start by training the model with 300 iterations and a low learning rate (\(10^{-5}\)) and gradually increase it to 10.

  • Optimizer: Ch 11

  • Batch size: 32

  • Activation function: ReLU for hidden layers, output layer depends on task

  • # of iterations: Don't worry about it, use Early Stopping instead

Tips for training NN

  • Regularization: apply penalty in the loss function (when weight & bias are too high from layer to layer)

    • L1 (absolute/lasso): takes irrelevant features ⟶ sets weights to 0

    • L2 (squares/ridge): takes irrelevant features ⟶ shrinks weights smoothly

  • Early stopping: limit number of epochs when validation error stops improving

  • Drop out:

    • Use separate mini batches ⟶ remove a certain percent from each training batch for each layer (Need to multiply all weights by 1 - drop %)

    • During training we drop out some neurons; during testing we bring them back but discount their weights

  • Use different activation function (try maxout)

  • Use different learning rate optimizer

Why not just add more layers?

Sam

  • Vanishing gradient: First layers not nearly as impacted as later stages because much large variance in later stages

    • ReLU: A change in the slope from one neuron to the next will have the same impact on each resulting layer

    • Problem of Relu - Change of zero will lead to the resulting neuron dying out though

    • Alternative - Leaky ReLU - very small instead of 0

    • Alternative - Parametric ReLU - can adjust slope for the "below 0" section

    • Batch normalization: Makes sure that the changing distribution of prev layer's inputs aren't impacting us

    • Gradient clipping (for grad explosion): set a threshold that gradients can't go above/below