ML 10 DL
Links
-
Code Ageron: Ch 13 CNNs, Ch 14 RNNs
-
Code Ng: Multi-Class & Neural Nets, Neural Nets
Key concepts:
- The activation function is a hyperparameter, the weights & biases are parameters.
Key terms:
-
TLUs (threshold logic units): calculate a weighted sum of inputs ⟶ apply a threshold to produce a binary output
-
FNN (feedforward neural network): The architecture that the signal flows only in one direction from the inputs to the outputs
-
DNN (deep neural network): when an ANN contains a deep stack of hidden layers
Steps:
-
TLU computes weighted sum of inputs (IN & input weight). (Becomes
x-axisvalue)
\(z = w_1 x_1 + w_n x_n = X^T w\) -
TLU applies a step function to this sum. (Becomes
y-axisvalue.)
\(h_w(x) = step(z), where z=X^T w\)
ANNs¶
Sam
The ANN is a simple model of the biological neuron.
An artificial neuron contains:
-
1+ input neurons
-
1 output neuron
-
Connections between these. If a threshold number of connections are reached, the ON is activated.
We can build a network of artificial neurons that computes any logical proposition you want.
MLP¶
An MLP is composed of:
-
1 input layer (passthrough)
-
1+ hidden layers of TLUs (threshold logic units)
-
1 output layer of TLUs
Notes
- Every layer except the output layer includes a bias neuron and is fully connected to the next layer.
Equation¶
Sam
Outputs of fully connected layer = \(h_{W,b} (X) = \Theta(XW + b)\)
X = our dataset (matrix of input features)
-
1 row per instance
-
1 column per feature
W = weight matrix
-
1 row per input neuron (IN)
-
1 column per artifical neuron (AN) in the layer
b = bias vector, contains all connection weights between bias neuron & AN
- 1 bias term per AN
\(\Theta\) = activation function
Sam
Pg 290: Backpropagation is Gradient Descent but using an efficient technique for computing the gradients automatically.
-
Forward | Make prediction, measure total error
-
Backward | (in reverse) Go through each layer to measure each connection's error contribution
-
Gradient descent | Tweak connection weights
Sam
Backpropagation computes the gradients of cost function for every model parameter using reverse-mode autodiff
-
(Forward) Feed into network
-
For each layer, the output is found based on connection (weight & bias) Note that the connection is not linear so that we can take derivative using the chain rule.
-
Finds total network error
-
(Backwards) Uses chain rule to find how much each connection contributed to total error working from final layer to initial layer
-
(Gradient descent) Adjust the connection weights
Hyperparameters¶
Pg 323 | Paper by Leslie Smith
Sam
-
# hidden layers: Start with 1 or 2 hidden layers. Early layers find simple patterns, later layers find complex. Add until we start overfitting.
-
# neurons per hidden layer: Typically use the same for each (100), but could try adding more neurons to early layers if needed.
-
Learning rate: Start by training the model with 300 iterations and a low learning rate (\(10^{-5}\)) and gradually increase it to 10.
-
Optimizer: Ch 11
-
Batch size: 32
-
Activation function: ReLU for hidden layers, output layer depends on task
-
# of iterations: Don't worry about it, use Early Stopping instead
Tips for training NN¶
-
Regularization: apply penalty in the loss function (when weight & bias are too high from layer to layer)
-
L1 (absolute/lasso): takes irrelevant features ⟶ sets weights to 0
-
L2 (squares/ridge): takes irrelevant features ⟶ shrinks weights smoothly
-
-
Early stopping: limit number of epochs when validation error stops improving
-
Drop out:
-
Use separate mini batches ⟶ remove a certain percent from each training batch for each layer (Need to multiply all weights by
1 - drop %) -
During training we drop out some neurons; during testing we bring them back but discount their weights
-
-
Use different activation function (try maxout)
-
Use different learning rate optimizer
-
Standardscaler for numeric
Why not just add more layers?¶
Sam
-
Vanishing gradient: First layers not nearly as impacted as later stages because much large variance in later stages
-
ReLU: A change in the slope from one neuron to the next will have the same impact on each resulting layer
-
Problem of Relu - Change of zero will lead to the resulting neuron dying out though
-
Alternative - Leaky ReLU - very small instead of 0
-
Alternative - Parametric ReLU - can adjust slope for the "below 0" section
-
Batch normalization: Makes sure that the changing distribution of prev layer's inputs aren't impacting us
-
Gradient clipping (for grad explosion): set a threshold that gradients can't go above/below
-