Notes on CS231N

Last Updated: 2024-06-14 22:49:28 Friday

-- TOC --

Artificial neuron network (nn) is layered, so that it could effectively be programmed and learn from training data. The word "deep" simply means that there are a lot of layers in nn. This deep layered structure is actually a math model or function, which has input and output/loss. The trainning target is to lower the loss so that to makes input data be more fitted in the model. Finally, trained model can be used to predict input which is never seen before.

$$\cfrac{1}{N}\cdot L(f(x,w,b),w,y)+\lambda\cdot R(w)=loss$$

L and f are both designed and choosed, x and y are input data pair, w and b need to be learned. Normally L has a w-based regularization term. N is the number of training data. \(\lambda\) is a hyperparameter.

Linear Layer

The most simple layer is linear layer. Every input feature in x has a connection to every node in this layer. Every node performs affine transformation with its own w and b.

$$z_{in}w+b=z_{out}$$

Activation

In the middle of consecutive linear layers, there should be non-linear transforamtions, otherwise all consecutive linear layers could collapse into one single linear transformation, which could limit the nn's capability. We call them activation layers. The name is from activation function. There are a lot of different activation functions available, such as: sigmoid, tanh, relu etc.

activation functions

SGD and Minibatch

Stochastic Gradient Descent is somehow equal minibatch method. Why this stochastic method works? The reason this works well is that the examples in the training data are correlated. The gradient upon all training data could be represented by a randomly choosed minibatch in an approximate way.

Loss

We use loss to quantify how good our model is on training data. The lower the better. But things are not that simple because of the danger of overfit.

SVM Loss

$$L=\sum_i \sum_{j\neq y}max(0,z_{ij}+1-z_{iy})$$

z is the output of the last linear layer. \(z_{iy}\) represents the output value of the right class in i-th training data.

SVM loss could be zero.

Softmax Loss

$$L=\sum_i -\log{\left(\cfrac{e^{z_{iy}}}{\sum_j e^{z_{ij}}}\right)}$$

softmax

Regularization Penalty

To encode some preference for a certain set of weights. Regularization also provides many desired properties. The most common is L2 regularization:

$$R(w)=\sum_{i}w_i^2$$

Convolution Layer

To use filters to preserve spatial characteristics, this is a non-fully connected nn layer. We have to choose filter's size, stride, number and whether to deploy zero-padding (most common). The output of Conv. layer is 3D structured. For instance, 32x32x3 with 7 5x5x3 filters and stride 1 get 28x28x7 output. 1x1 filter could be used, which means staring at the single RGB pixel.

Pool Layer

There are max pooling and average pooling. They only do downsampling, keep depth unchanged. The non-one stride in Conv. layer might has the same result of pooling layer.

Papers

本文链接:https://cs.pynote.net/ag/ml/ann/202405319/

-- EOF --

-- MORE --