深度学习常见的公式（更新中）

Wednesday, July 10, 2024

本文共925字

2分钟阅读时长

posts

杂谈

⚠️本文是作者P3troL1er原创，首发于https://peterliuzhi.top/posts/%E6%B7%B1%E5%BA%A6%E5%AD%A6%E4%B9%A0%E5%B8%B8%E8%A7%81%E7%9A%84%E5%85%AC%E5%BC%8F%E6%9B%B4%E6%96%B0%E4%B8%AD/。商业转载请联系作者获得授权，非商业转载请注明出处！

之前突发奇想想要总结一下的，当时用了英文写，懒得翻译回来了

Formula Cheatsheet

Layer Normalization

$$ \begin{aligned} \mu &= \frac{1}{H}∑_{i=1}^{H}x_i \newline \sigma^2 &= \frac{1}{H}∑_{i=1}^{H}(x_i-μ)^2 \newline \hat{x}_i &= \frac{x_i - μ}{\sqrt{σ^2+ϵ}}\newline y_i &= \gamma \hat{x}_i + \beta \end{aligned} $$

Batch Normalization

$$ \begin{aligned} \mu_{\mathcal{B}} &= \frac{1}{m}∑_{i=1}^{m}x_i \newline \sigma_{\mathcal{B}}^2 &= \frac{1}{m}∑_{i=1}^{m}(x_i-μ_{\mathcal{B}})^2 \newline \hat{x}i &= \frac{x_i - μ{\mathcal{B}}}{\sqrt{σ_{\mathcal{B}}^2+ϵ}}\newline y_i &= \gamma \hat{x}_i + \beta \end{aligned} $$

Some Criteria

Accuracy

$$ ACC = \frac{TP + TN}{TP + TN + FP + FN} $$

Precision

$$ Precision = \frac{TP}{TP + FP} $$

Recall

$$ Recall = \frac{TP}{TP + FN} $$

F1 Score

$$ F1 = 2 \cdot \frac{Precision \cdot Recall}{Precision + Recall} $$

L1 regularization (Lasso Regression)

$$ L(\theta) = L_{origin}(\theta) + \lambda \sum_{j=1}^{n}\vert \theta_j \vert $$

L2 regularization (Ridge Regression)

$$ L(\theta) = L_{origin}(\theta) + \lambda \sum_{j=1}^{n}\theta_j^2 $$

Bayes’ theorem

$$ P(A|B) = \frac{P(B|A)P(A)}{P(B)} $$

Naive Bayes Model

$$ \begin{aligned} P(C|X) &\propto P(C) \prod_{i=1}^{n}P(x_i) \newline \hat{C} & = \text{argmax}_C P(C|X) \end{aligned} $$

Softmax

The softmax function can transform logits into probability vector

$$ \sigma(z_i) = \frac{e^{z_i}}{\sum_{j=1}^K e^{z_j}} $$

and its gradient can be computed as

$$ \frac{\partial \sigma(z_i)}{\partial z_j} = \left{\begin{matrix} \sigma(z_i)(1 - \sigma(z_i)) \text{ when } i = j \newline -\sigma(z_i)\sigma(z_j) \text{ when } i \ne j \end{matrix}\right. $$

when $z_i \to \pm\infty$, $e^{z_i} \to +\infty$, $\sigma(z_i) \to 1$ and $\sigma(z_j) \to 0 (i \ne j)$

$$ \sigma(z_i)(1 - \sigma(z_i)) \to 0 \text{ when } i = j \newline -\sigma(z_i)\sigma(z_j) \to 0 \text{ when } i \ne j $$

Convolution

one dimension

$$ (f \ast g)(t) = \int_{-\infty}^{\infty} f(\tau)g(t-\tau) $$

two dimension

$$ (f \ast g)(i, j) = \int_{-\infty}^{\infty}\int_{-\infty}^{\infty} f(i-m, j-n)g(m, n) $$

gradient

For input,

$$ \frac{\partial L}{\partial X} = \frac{\partial L}{\partial Y} \ast k_{rot} $$ where $k_{rot}(i, j) = k(-i, -j)$

For kernel,

$$ \frac{\partial L}{\partial k} = \frac{\partial L}{\partial Y} \ast X_{pad} $$ where $X_{pad}$ is $X$ with appropriate padding.

Activation functions

ReLU

$$ ReLU(x) = \max(0, x) $$

Leaky ReLU

$$ LeakyReLU(x) = \left{\begin{matrix} x, x > 0 \newline \alpha x, x < 0 \end{matrix}\right. $$ where $\alpha$ is a small parameter which can be set as $0.01$

Sigmoid

$$ f(x) = \frac{1}{1 + e^{-x}} $$

Tanh

$$ \tanh(x) = \frac{e^{x} - e^{-x}}{e^{x} + e^{-x}} $$

Loss functions

MSE

$$ MSE = \frac{1}{n}\sum_{i=1}^{n}(y_i - \hat{y}_i)^2 $$

Cross-Entropy Loss

$$ CE = -\frac{1}{n}\sum_{i=1}^n[y_i\log(\hat{y}_i) + (1-y_i)log(1-\hat{y}_i)] $$

Hinge Loss (For SVM)

$$ HingeLoss = \sum_{i=1}^n\max(0, 1-y_i \cdot \hat{y}_i) \ \ (y_i = 1, -1) $$

KL divergence

$$ D_{KL}(P || Q) = -\mathbb{E}_{P(X)}[\log \frac{Q(X)}{P(X)}] \ge 0 $$

Its non-negativity can be prooved through Gibbs’ inequality. However, the proof of Gibbs’ inequality is also the proof of the aforementioned non-negativity.

SVM

Linear

Hard margin

$$ \min_{\mathbf{w}, b} \left| \mathbf{w} \right|_2^2 \newline s.t. y_i(\mathbf{w}x_i-b) \ge 1 $$

Soft margin (For data can not be linearly seperated)

With $\zeta$ denoting Hinge Loss,

$$ \min_{\mathbf{w}, b} \left| \mathbf{w} \right|_2^2 + C\zeta \newline s.t. y_i(\mathbf{w}x_i-b) \ge 1 - \zeta $$

Non-Linear (Kernel function trick)

with the usage of kernel function, the function of SVM are changed into

$$ f(x) = sign(\sum_{i=1}^n\alpha_iy_iK(x_i, \mathbf{x}) + b) $$ where $\alpha_i$ is Lagrange multiplier and $\alpha, b$ can get from the derivation of dual problem

Polynomial(homogeneous)

$$ K(x_i, x_j) = (x_i \cdot c_j)^d $$

Polynomial(inhomogeneous)

$$ K(x_i, x_j) = (x_i \cdot c_j + r)^d $$

Gaussian radial basis function

$$ K(x_i, x_j) = exp(-\gamma \left| x_i - x_j \right|^2) $$

Transformer

Self-Attention

$$ Attention(Q, K, V) = Softmax(\frac{QK^T}{\sqrt{d_k}}) V $$

Multihead Attention

$$ \begin{align*} MultiHead(Q, K, V) = concat(head_1, \ldots, head_h)W^O \newline where \ head_i = Attention(QW_i^Q, KW_i^K, VW_i^V) \end{align*} $$

Position Embedding

$$ PE(pos, 2i) = \sin(pos/10000^{2i/d_{model}}) \newline PE(pos, 2i+1) = \cos(pos/10000^{2i/d_{model}}) $$

and in practice, $pos/10000^{2i/d_{model}}$ can be optimized into

$$ \exp(-\log{10000} \cdot 2i/d_{model}) = e^{-{\log{10000}}^{2i/d_{model}}} = (1/10000)^{2i/d_{model}} $$

Position-wise Feed-Forward Networks

This is the full-connected layer used in transformer

$$ FFN(x) = max(0, xW_1 + b_1)W_2 + b_2 $$ and it can alse be described as two convolutions with kernel size 1.

GAN

$$ \min_G\max_D V(D, G) = \mathbb{E}{\mathbf{x} \sim P{data}(\mathbf{x})}[\log D(\mathbf{x})] + \mathbb{E}{\mathbf{z} \sim P{z}(\mathbf{x})}[\log D(G(\mathbf{z}))] $$

Note that here we use cross-entropy loss.

See more in https://jaketae.github.io/study/gan-math/

Diffusion Model

See more in https://www.assemblyai.com/blog/diffusion-models-for-machine-learning-introduction/

Forward Process

$$ q(x_t|x_{t-1}) = \mathcal{N}(x_t;\sqrt{1-\beta_t}x_{t-1}, \beta_t\mathbf{I}) $$

Reverse Process

$$ p_\theta(x_{t-1}|x_t) = \mathcal{N}(x_{t-1};\mu_\theta(x_t, t), \Sigma_\theta(x_t, t)) $$

Loss(VLB)

$$ L = \mathbb{E}{q(x{0:T})}[\log{\frac{q(x_{1:T}|x_0)}{p_{\theta}(x_{0:T})}}] $$