【论文笔记】EmoGen Emotional Image Content Generation with Text-to-Image Diffusion Models

Wednesday, October 2, 2024

本文共1207字

3分钟阅读时长

论文笔记

人工智能 , 深度学习

⚠️本文是作者P3troL1er原创，首发于https://peterliuzhi.top/posts/%E8%AE%BA%E6%96%87%E7%AC%94%E8%AE%B0/%E8%AE%BA%E6%96%87%E7%AC%94%E8%AE%B0emogen-emotional-image-content-generation-with-text-to-image-diffusion-models/。商业转载请联系作者获得授权，非商业转载请注明出处！

Abstract

这是一篇关于图像生成的论文
提出了Emotional Image Content Generation (EICG)用于生成semantic-clear和emotion-faithful的图片
- 提出了情感空间emotion space
- 使用了一个映射网络将emotion space和CLIP space对齐
- 使用了Attribute loss和emotion confidence
使用了三种指标：emotion accuracy, semantic clarity and semantic diversity

Motivation

在CV中，情感很重要，有很多相关应用
扩散模型的发展提供了text-to-image的有力工具，但是在生成一些抽象的东西（比如情感）上有困难
以往的解决方案不能准确、显著地生成带情绪的图片
不能只从颜色和样式的角度生成情感
Emoset提供了与情感相关的数据集支撑
提出EICG

Method

Emotion space

相似的emotion的点聚集，不相似的远离
使用resnet-50作为捕捉emotion表示的网络结构，使用emoset进行监督学习
使用交叉熵作为loss函数$$\mathcal{L}{\text{emo}} = - \sum{i=1}^{C} y_{\text{emo}} \log \left( \frac{\exp(\varphi(x, i))}{\sum_{i=1}^{C} \exp(\varphi(x, i))} \right)$$
在推理阶段，每个emotion cluster都由从对应的高斯分布中随机抽取，保证了有效性和多样性

emotion space的思想有点类似MLP中的embedding vector，是否只要单独预训练好了，其他模型也可以即插即用？

如果输入数据同时包含多种复杂情感应该怎么办呢？目前的这个使用resnet的解决方案看起来是一个分类的方法，面对复杂的情感，是否可以选取top k的类别？

Mapping Network

使用一个映射网络将emotion space转换到CLIP space
因为在emotion space中的点在CLIP space中可能是分散的，因此我们不能用线性变换，因此使用了MLP进行非线性变换
然后再通过CLIP transformer
最后通过全连接层转换到CLIP space
为了更好地利用CLIP空间的知识，后两者是冻结参数的
上面三个转换步骤合在一起称为mapping network，其实只有第一层MLP需要训练
映射后的结果输入到扩散模型后的U-net中进行下游任务

loss

$$ \mathcal{L}{\text{LDM}} = \mathbb{E}{z,x,\epsilon,t} \left[ \left| \epsilon - \epsilon_{\theta} \left( z_t, t, t_{\theta} \left( F \left( \varphi(x) \right) \right) \right) \right|_2^2 \right] $$

$\epsilon$是噪声，$\epsilon_{\theta}$是去噪网络，$z_t$表示表示对时间$t$的潜在噪声

但是，只用LDM loss是不够的，因为同样的情感的语义可能是多样的，而只使用LDM会使一个抽象的情感坍缩为一个具体的事物，使其丧失多样性

因此，基于Emoset，提出了$\mathcal{L}_{attr}$

$$ \mathcal{L}{\text{attr}} = - \sum{j=1}^{C} y_{\text{attr}} \log \left( \frac{\exp(f(v_{emo}, \tau_\theta(a_j)))}{\sum_{j=1}^{C} \exp(f(v_{emo}, \tau_\theta(a_j)))} \right) $$

其中$f$是余弦相似性

confidence

因为不是所有图片都有情感，所以我们可以动态调整LDM loss和attr loss的占比

$$ \mathcal{L}{\text{stage-2}} = \left( 1 - \alpha{ij} \right) \mathcal{L}{\text{LDM}} + \alpha{ij} \mathcal{L}_{\text{attr}} $$

其中，$\alpha_{ij}$是emotion confidence

$$ \alpha_{ij} = \frac{1}{N_j}\sum_{n=1}^{N_j} p(x_n, i) $$

其中$p(\cdot)$是emotion space中的emotion vector，$x_n$是输入图片，$i$是第i个emotion，$N_j$是属于这个attribute的图片数量

【论文笔记】EmoGen Emotional Image Content Generation with Text-to-Image Diffusion Models

Abstract

Motivation

Method

Emotion space

Mapping Network

loss

confidence

扫码阅读此文章
点击按钮复制分享信息

本页内容

相关

【论文笔记】EmoGen Emotional Image Content Generation with Text-to-Image Diffusion Models

Abstract

Motivation

Method

Emotion space

Mapping Network

loss

confidence

扫码阅读此文章 点击按钮复制分享信息

本页内容

相关

扫码阅读此文章
点击按钮复制分享信息