深度学习常见的公式(更新中)
Wednesday, July 10, 2024
本文共925字
2分钟阅读时长
⚠️本文是作者P3troL1er原创,首发于https://peterliuzhi.top/posts/%E6%B7%B1%E5%BA%A6%E5%AD%A6%E4%B9%A0%E5%B8%B8%E8%A7%81%E7%9A%84%E5%85%AC%E5%BC%8F%E6%9B%B4%E6%96%B0%E4%B8%AD/。商业转载请联系作者获得授权,非商业转载请注明出处!
之前突发奇想想要总结一下的,当时用了英文写,懒得翻译回来了
Layer Normalization
$$
\begin{aligned}
\mu &= \frac{1}{H}∑_{i=1}^{H}x_i \newline
\sigma^2 &= \frac{1}{H}∑_{i=1}^{H}(x_i-μ)^2 \newline
\hat{x}_i &= \frac{x_i - μ}{\sqrt{σ^2+ϵ}}\newline
y_i &= \gamma \hat{x}_i + \beta
\end{aligned}
$$
Batch Normalization
$$
\begin{aligned}
\mu_{\mathcal{B}} &= \frac{1}{m}∑_{i=1}^{m}x_i \newline
\sigma_{\mathcal{B}}^2 &= \frac{1}{m}∑_{i=1}^{m}(x_i-μ_{\mathcal{B}})^2 \newline
\hat{x}i &= \frac{x_i - μ{\mathcal{B}}}{\sqrt{σ_{\mathcal{B}}^2+ϵ}}\newline
y_i &= \gamma \hat{x}_i + \beta
\end{aligned}
$$
Some Criteria
Accuracy
$$
ACC = \frac{TP + TN}{TP + TN + FP + FN}
$$
Precision
$$
Precision = \frac{TP}{TP + FP}
$$
Recall
$$
Recall = \frac{TP}{TP + FN}
$$
F1 Score
$$
F1 = 2 \cdot \frac{Precision \cdot Recall}{Precision + Recall}
$$
L1 regularization (Lasso Regression)
$$
L(\theta) = L_{origin}(\theta) + \lambda \sum_{j=1}^{n}\vert \theta_j \vert
$$
L2 regularization (Ridge Regression)
$$
L(\theta) = L_{origin}(\theta) + \lambda \sum_{j=1}^{n}\theta_j^2
$$
Bayes’ theorem
$$
P(A|B) = \frac{P(B|A)P(A)}{P(B)}
$$
Naive Bayes Model
$$
\begin{aligned}
P(C|X) &\propto P(C) \prod_{i=1}^{n}P(x_i) \newline
\hat{C} & = \text{argmax}_C P(C|X)
\end{aligned}
$$
Softmax
The softmax function can transform logits into probability vector
$$
\sigma(z_i) = \frac{e^{z_i}}{\sum_{j=1}^K e^{z_j}}
$$
and its gradient can be computed as
$$
\frac{\partial \sigma(z_i)}{\partial z_j} = \left{\begin{matrix}
\sigma(z_i)(1 - \sigma(z_i)) \text{ when } i = j \newline
-\sigma(z_i)\sigma(z_j) \text{ when } i \ne j
\end{matrix}\right.
$$
when $z_i \to \pm\infty$, $e^{z_i} \to +\infty$, $\sigma(z_i) \to 1$ and $\sigma(z_j) \to 0 (i \ne j)$
$$
\sigma(z_i)(1 - \sigma(z_i)) \to 0 \text{ when } i = j \newline
-\sigma(z_i)\sigma(z_j) \to 0 \text{ when } i \ne j
$$
Convolution
one dimension
$$
(f \ast g)(t) = \int_{-\infty}^{\infty} f(\tau)g(t-\tau)
$$
two dimension
$$
(f \ast g)(i, j) = \int_{-\infty}^{\infty}\int_{-\infty}^{\infty} f(i-m, j-n)g(m, n)
$$
gradient
For input,
$$
\frac{\partial L}{\partial X} = \frac{\partial L}{\partial Y} \ast k_{rot}
$$ where $k_{rot}(i, j) = k(-i, -j)$
For kernel,
$$
\frac{\partial L}{\partial k} = \frac{\partial L}{\partial Y} \ast X_{pad}
$$ where $X_{pad}$ is $X$ with appropriate padding.
Activation functions
ReLU
$$
ReLU(x) = \max(0, x)
$$
Leaky ReLU
$$
LeakyReLU(x) = \left{\begin{matrix}
x, x > 0 \newline
\alpha x, x < 0
\end{matrix}\right.
$$ where $\alpha$ is a small parameter which can be set as $0.01$
Sigmoid
$$
f(x) = \frac{1}{1 + e^{-x}}
$$
Tanh
$$
\tanh(x) = \frac{e^{x} - e^{-x}}{e^{x} + e^{-x}}
$$
Loss functions
MSE
$$
MSE = \frac{1}{n}\sum_{i=1}^{n}(y_i - \hat{y}_i)^2
$$
Cross-Entropy Loss
$$
CE = -\frac{1}{n}\sum_{i=1}^n[y_i\log(\hat{y}_i) + (1-y_i)log(1-\hat{y}_i)]
$$
Hinge Loss (For SVM)
$$
HingeLoss = \sum_{i=1}^n\max(0, 1-y_i \cdot \hat{y}_i) \ \ (y_i = 1, -1)
$$
KL divergence
$$
D_{KL}(P || Q) = -\mathbb{E}_{P(X)}[\log \frac{Q(X)}{P(X)}] \ge 0
$$
Its non-negativity can be prooved through Gibbs’ inequality. However, the proof of Gibbs’ inequality is also the proof of the aforementioned non-negativity.
SVM
Linear
Hard margin
$$
\min_{\mathbf{w}, b} \left| \mathbf{w} \right|_2^2 \newline
s.t. y_i(\mathbf{w}x_i-b) \ge 1
$$
Soft margin (For data can not be linearly seperated)
With $\zeta$ denoting Hinge Loss,
$$
\min_{\mathbf{w}, b} \left| \mathbf{w} \right|_2^2 + C\zeta \newline
s.t. y_i(\mathbf{w}x_i-b) \ge 1 - \zeta
$$
Non-Linear (Kernel function trick)
with the usage of kernel function, the function of SVM are changed into
$$
f(x) = sign(\sum_{i=1}^n\alpha_iy_iK(x_i, \mathbf{x}) + b)
$$ where $\alpha_i$ is Lagrange multiplier and $\alpha, b$ can get from the derivation of dual problem
Polynomial(homogeneous)
$$
K(x_i, x_j) = (x_i \cdot c_j)^d
$$
Polynomial(inhomogeneous)
$$
K(x_i, x_j) = (x_i \cdot c_j + r)^d
$$
Gaussian radial basis function
$$
K(x_i, x_j) = exp(-\gamma \left| x_i - x_j \right|^2)
$$
Self-Attention
$$
Attention(Q, K, V) = Softmax(\frac{QK^T}{\sqrt{d_k}}) V
$$
Multihead Attention
$$
\begin{align*}
MultiHead(Q, K, V) = concat(head_1, \ldots, head_h)W^O \newline
where \ head_i = Attention(QW_i^Q, KW_i^K, VW_i^V)
\end{align*}
$$
Position Embedding
$$
PE(pos, 2i) = \sin(pos/10000^{2i/d_{model}}) \newline
PE(pos, 2i+1) = \cos(pos/10000^{2i/d_{model}})
$$
and in practice, $pos/10000^{2i/d_{model}}$ can be optimized into
$$
\exp(-\log{10000} \cdot 2i/d_{model}) = e^{-{\log{10000}}^{2i/d_{model}}} = (1/10000)^{2i/d_{model}}
$$
Position-wise Feed-Forward Networks
This is the full-connected layer used in transformer
$$
FFN(x) = max(0, xW_1 + b_1)W_2 + b_2
$$
and it can alse be described as two convolutions with kernel size 1.
GAN
$$
\min_G\max_D V(D, G) = \mathbb{E}{\mathbf{x} \sim P{data}(\mathbf{x})}[\log D(\mathbf{x})] + \mathbb{E}{\mathbf{z} \sim P{z}(\mathbf{x})}[\log D(G(\mathbf{z}))]
$$
Note that here we use cross-entropy loss.
See more in https://jaketae.github.io/study/gan-math/
Diffusion Model
See more in https://www.assemblyai.com/blog/diffusion-models-for-machine-learning-introduction/
Forward Process
$$
q(x_t|x_{t-1}) = \mathcal{N}(x_t;\sqrt{1-\beta_t}x_{t-1}, \beta_t\mathbf{I})
$$
Reverse Process
$$
p_\theta(x_{t-1}|x_t) = \mathcal{N}(x_{t-1};\mu_\theta(x_t, t), \Sigma_\theta(x_t, t))
$$
Loss(VLB)
$$
L = \mathbb{E}{q(x{0:T})}[\log{\frac{q(x_{1:T}|x_0)}{p_{\theta}(x_{0:T})}}]
$$
扫码阅读此文章
点击按钮复制分享信息
点击订阅