# Lecture 4

## Deep Learning Fundamentals

#### Softmax Classifier

Multinomial Logistic Regression.

Want to interpret raw classifier as probabilities.

$$s = f(x\_i; W)$$ , $$P(Y =k | X = x\_i) = \frac{e^sk}{\sum\_j e^sj}$$

$$L\_i = - \log P(Y = y\_i | X = x\_i)$$

Cross Entropy: $$H(P, Q) = H(p) + D\_{KL}(P||Q)$$, P = correcct probs, Q = normalized probabilities.

Put all together: $$L\_i = -\log (\frac{e^sy\_i}{\sum\_j e^sj})$$

### Optimization

Random search, or follow the slope.

Gradient is the vector of partial derivatives along each dimension - steepest descent is negative gradient.

Loss gradient: denoted $$\frac{\partial E\_n}{\partial w\_{ji}^{(1)}}$$ or $$\Delta\_w L$$

Gradient descent: update weights iteratively - move in opposite to gradient.

$$w^{\tau+1} = w^\tau - \eta \Delta E(w^\tau)$$

SGD: Full sum is expensive. Common minibatch: 32,64,128

$$\Delta\_w L(W) = \frac{1}{N} \sum\_{i=1}^N \delta\_w L\_i(x\_i, y\_i, W) + \lambda \Delta\_w R(W)$$

**Problem with Linear Classifier**

If data is not linearly separable.

One Solution: feature transforms. E.g data in circle -> polar space.

$$f(x,y) = (r(x,y), \theta(x,y))$$

### Neutal Networks.

<figure><img src="/files/xCQTZpqJk6EjXRV7oNzG" alt=""><figcaption><p>NN</p></figcaption></figure>

Activation function - ReLu (for non-linearity)

How to compute gradient?

<figure><img src="/files/fDE9D9RDgHNXgn83V6ub" alt=""><figcaption><p>back prop</p></figcaption></figure>

## CNN

Fully connected layer

e.g stretch 32x32x3 image to 3072x1.&#x20;

Convolution layer: preserve spatial structure.

Convolve the filter with image e.g slide over image spatially, computing dot products. Filter always extend to full depth of the input.

Each number - we calculate dot product between filter and 5x5x3 chunk of image -> $$w^Tx + b$$

If we get 6 such filter, we stack each activation map and got a result of new imge 28x28x6.

ConvNet is a sequence of Convolution layers interspersed with activation functions.

In practice: common to zero pad the border.

Dimension: N + 2P - F / stride + 1 (each side)

**Pooling layer**

make representation smaller and more manageable.

operates over each activation map independently.

E.g max pooling (take max)

## Training NN

**Activation Functions**

In practice: use ReLu, don't use sigmoid or tanh.

Data preprocessing:

In practice: center only.

* Subtract mean image
* Subtract per channel mean
* Subtract per channel mean and divide by per channel std.

Weight Initialization:

Xavier (Sigmoid) / He initialization (ReLU)

* Too small -> activations go zero, gradients also 0
* Too big -> saturate, gradients 0
* Just right

Fancier optimizer: Adam.


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://tianyi0216.gitbook.io/blog/course_notes/cs-839-notes/lecture-4.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
