Lecture 4
Deep Learning Fundamentals
Softmax Classifier
Multinomial Logistic Regression.
Want to interpret raw classifier as probabilities.
,
Cross Entropy: , P = correcct probs, Q = normalized probabilities.
Put all together:
Optimization
Random search, or follow the slope.
Gradient is the vector of partial derivatives along each dimension - steepest descent is negative gradient.
Loss gradient: denoted or
Gradient descent: update weights iteratively - move in opposite to gradient.
SGD: Full sum is expensive. Common minibatch: 32,64,128
Problem with Linear Classifier
If data is not linearly separable.
One Solution: feature transforms. E.g data in circle -> polar space.
Neutal Networks.

Activation function - ReLu (for non-linearity)
How to compute gradient?

CNN
Fully connected layer
e.g stretch 32x32x3 image to 3072x1.
Convolution layer: preserve spatial structure.
Convolve the filter with image e.g slide over image spatially, computing dot products. Filter always extend to full depth of the input.
Each number - we calculate dot product between filter and 5x5x3 chunk of image ->
If we get 6 such filter, we stack each activation map and got a result of new imge 28x28x6.
ConvNet is a sequence of Convolution layers interspersed with activation functions.
In practice: common to zero pad the border.
Dimension: N + 2P - F / stride + 1 (each side)
Pooling layer
make representation smaller and more manageable.
operates over each activation map independently.
E.g max pooling (take max)
Training NN
Activation Functions
In practice: use ReLu, don't use sigmoid or tanh.
Data preprocessing:
In practice: center only.
Subtract mean image
Subtract per channel mean
Subtract per channel mean and divide by per channel std.
Weight Initialization:
Xavier (Sigmoid) / He initialization (ReLU)
Too small -> activations go zero, gradients also 0
Too big -> saturate, gradients 0
Just right
Fancier optimizer: Adam.
Last updated