ML 10 DL

Links

Key concepts:

The activation function is a hyperparameter, the weights & biases are parameters.

Key terms:

TLUs (threshold logic units): calculate a weighted sum of inputs ⟶ apply a threshold to produce a binary output
FNN (feedforward neural network): The architecture that the signal flows only in one direction from the inputs to the outputs
DNN (deep neural network): when an ANN contains a deep stack of hidden layers

Steps:

TLU computes weighted sum of inputs (IN & input weight). (Becomes x-axis value)
\(z = w_1 x_1 + w_n x_n = X^T w\)
TLU applies a step function to this sum. (Becomes y-axis value.)
\(h_w(x) = step(z), where z=X^T w\)

Sam

The ANN is a simple model of the biological neuron.

An artificial neuron contains:

1+ input neurons
1 output neuron
Connections between these. If a threshold number of connections are reached, the ON is activated.

We can build a network of artificial neurons that computes any logical proposition you want.

An MLP is composed of:

Notes

Every layer except the output layer includes a bias neuron and is fully connected to the next layer.

Sam

Outputs of fully connected layer = \(h_{W,b} (X) = \Theta(XW + b)\)

X = our dataset (matrix of input features)

W = weight matrix

b = bias vector, contains all connection weights between bias neuron & AN

\(\Theta\) = activation function

Sam

Pg 290: Backpropagation is Gradient Descent but using an efficient technique for computing the gradients automatically.

Forward | Make prediction, measure total error
Backward | (in reverse) Go through each layer to measure each connection's error contribution
Gradient descent | Tweak connection weights

Sam

Backpropagation computes the gradients of cost function for every model parameter using reverse-mode autodiff

(Forward) Feed into network
For each layer, the output is found based on connection (weight & bias) Note that the connection is not linear so that we can take derivative using the chain rule.
Finds total network error
(Backwards) Uses chain rule to find how much each connection contributed to total error working from final layer to initial layer
(Gradient descent) Adjust the connection weights

Sam

# hidden layers: Start with 1 or 2 hidden layers. Early layers find simple patterns, later layers find complex. Add until we start overfitting.
# neurons per hidden layer: Typically use the same for each (100), but could try adding more neurons to early layers if needed.
Learning rate: Start by training the model with 300 iterations and a low learning rate (\(10^{-5}\)) and gradually increase it to 10.
Optimizer: Ch 11
Batch size: 32
Activation function: ReLU for hidden layers, output layer depends on task
# of iterations: Don't worry about it, use Early Stopping instead

OverfittingUnderfittingScaling input data

Regularization: apply penalty in the loss function (when weight & bias are too high from layer to layer)
- L1 (absolute/lasso): takes irrelevant features ⟶ sets weights to 0
- L2 (squares/ridge): takes irrelevant features ⟶ shrinks weights smoothly
Early stopping: limit number of epochs when validation error stops improving
Drop out:
- Use separate mini batches ⟶ remove a certain percent from each training batch for each layer (Need to multiply all weights by 1 - drop %)
- During training we drop out some neurons; during testing we bring them back but discount their weights

Sam

Vanishing gradient: First layers not nearly as impacted as later stages because much large variance in later stages
- ReLU: A change in the slope from one neuron to the next will have the same impact on each resulting layer
- Problem of Relu - Change of zero will lead to the resulting neuron dying out though
- Alternative - Leaky ReLU - very small instead of 0
- Alternative - Parametric ReLU - can adjust slope for the "below 0" section
- Batch normalization: Makes sure that the changing distribution of prev layer's inputs aren't impacting us
- Gradient clipping (for grad explosion): set a threshold that gradients can't go above/below