文章目录
概率图模型有向 vs 无向 概率模型参数估计隐变量估计变分推断 应用例子1 BM RBM应用示例2 概率图模型中网络结构估计概率图模型
有向 vs 无向
概率图模型用图刻画一组随机变量之间的相关关系. 有向(无环)图刻画的概率图模型称作 bayesian network, 无向图刻画的概率图模型称作 markov network.
有向图模型和无向图模型直观的区别在于“因果性”, 本质的区别在于两种模型建模了不同的独立关系.例如,从 independence map 的角度: 有向图模型无法表示无向环形关系, 无向图模型无法表示有向 V 形结构.
无向图模型
Pr ( s ) = 1 Z ∏ C ϕ C ( s C ) \Pr(s) = \frac{1}{Z} \prod_{C} \phi_C(s_C) Pr(s)=Z1C∏ϕC(sC)
有向图模型
Pr ( s ) = ∏ i Pr ( s i ∣ p a r e n t ( s i ) ) \Pr(s) = \prod_i \Pr(s_i \vert \mathrm{parent}(s_i)) Pr(s)=i∏Pr(si∣parent(si))
概率模型
参数估计
参数分布
Pr ( θ ∣ S ) ∝ Pr ( θ ) Pr ( S ∣ θ ) \Pr(\theta \vert S) \propto \Pr(\theta) \Pr(S \vert \theta) Pr(θ∣S)∝Pr(θ)Pr(S∣θ)
极大后验
arg max θ Pr ( θ ∣ S ) = arg max θ Pr ( θ ) Pr ( S ∣ θ ) \arg \max_{\theta} \Pr(\theta \vert S) = \arg \max_{\theta} \Pr(\theta) \Pr(S \vert \theta) argθmaxPr(θ∣S)=argθmaxPr(θ)Pr(S∣θ)
极大似然 (极大后验的基础上假定参数的先验均匀)
arg max θ Pr ( θ ∣ S ) = arg max θ Pr ( S ∣ θ ) \arg \max_{\theta} \Pr(\theta \vert S) = \arg \max_{\theta} \Pr(S \vert \theta) argθmaxPr(θ∣S)=argθmaxPr(S∣θ)
隐变量估计
极大似然
arg max θ Pr ( θ ∣ S ) = arg max θ Pr ( S ∣ θ ) = arg max θ ∑ Z Pr ( S , Z ∣ θ ) \arg \max_{\theta} \Pr(\theta \vert S) = \arg \max_{\theta} \Pr(S \vert \theta) = \arg \max_{\theta} \sum_{Z} \Pr(S,Z \vert \theta) argθmaxPr(θ∣S)=argθmaxPr(S∣θ)=argθmaxZ∑Pr(S,Z∣θ)
似然下界
weighted algebra geometry inequality 角度:
∑ Z Pr ( S , Z ∣ θ ) = ∑ Z Pr ( Z ∣ η ) Pr ( S , Z ∣ θ ) Pr ( Z ∣ η ) ⩾ ∏ Z ( Pr ( S , Z ∣ θ ) Pr ( Z ∣ η ) ) Pr ( Z ∣ η ) \sum_{Z} \Pr(S,Z \vert \theta) = \sum_{Z} \Pr(Z \vert \eta) \frac{\Pr(S,Z \vert \theta)}{\Pr(Z \vert \eta)} \geqslant \prod_{Z} \left( \frac{\Pr(S,Z \vert \theta)}{\Pr(Z \vert \eta)} \right)^{\Pr(Z \vert \eta)} Z∑Pr(S,Z∣θ)=Z∑Pr(Z∣η)Pr(Z∣η)Pr(S,Z∣θ)⩾Z∏(Pr(Z∣η)Pr(S,Z∣θ))Pr(Z∣η)
jensen’s inequality 角度:
log ∑ Z Pr ( S , Z ∣ θ ) = log ∑ Z Pr ( Z ∣ η ) Pr ( S , Z ∣ θ ) Pr ( Z ∣ η ) ⩾ ∑ Z Pr ( Z ∣ η ) log Pr ( S , Z ∣ θ ) Pr ( Z ∣ η ) \log \sum_{Z} \Pr(S,Z \vert \theta) = \log \sum_{Z} \Pr(Z \vert \eta) \frac{\Pr(S,Z \vert \theta)}{\Pr(Z \vert \eta)} \geqslant \sum_{Z} \Pr(Z \vert \eta) \log \frac{\Pr(S,Z \vert \theta)}{\Pr(Z \vert \eta)} logZ∑Pr(S,Z∣θ)=logZ∑Pr(Z∣η)Pr(Z∣η)Pr(S,Z∣θ)⩾Z∑Pr(Z∣η)logPr(Z∣η)Pr(S,Z∣θ)
∑ Z Pr ( Z ∣ η ) log Pr ( S , Z ∣ θ ) Pr ( Z ∣ η ) = ∑ Z Pr ( Z ∣ η ) log Pr ( S , Z ∣ θ ) − ∑ Z Pr ( Z ∣ η ) log Pr ( Z ∣ η ) = ∑ Z Pr ( Z ∣ η ) log Pr ( S , Z ∣ θ ) + E n t r o p y ( Pr ( Z ∣ η ) ) \begin{aligned} \sum_{Z} \Pr(Z \vert \eta) \log \frac{\Pr(S,Z \vert \theta)}{\Pr(Z \vert \eta)} &= \sum_{Z} \Pr(Z \vert \eta) \log \Pr(S,Z \vert \theta) - \sum_{Z} \Pr(Z \vert \eta) \log \Pr(Z \vert \eta) \\ &= \sum_{Z} \Pr(Z \vert \eta) \log \Pr(S,Z \vert \theta) + \mathrm{Entropy}(\Pr(Z \vert \eta)) \\ \end{aligned} Z∑Pr(Z∣η)logPr(Z∣η)Pr(S,Z∣θ)=Z∑Pr(Z∣η)logPr(S,Z∣θ)−Z∑Pr(Z∣η)logPr(Z∣η)=Z∑Pr(Z∣η)logPr(S,Z∣θ)+Entropy(Pr(Z∣η))
下界间隙
∑ Z Pr ( Z ∣ η ) log Pr ( S , Z ∣ θ ) Pr ( Z ∣ η ) = ∑ Z Pr ( Z ∣ η ) log Pr ( Z ∣ S , θ ) Pr ( S ∣ θ ) Pr ( Z ∣ η ) = log Pr ( S ∣ θ ) − K L ( Pr ( Z ∣ η ) ∥ Pr ( Z ∣ S , θ ) ) \begin{aligned} \sum_{Z} \Pr(Z \vert \eta) \log \frac{\Pr(S,Z \vert \theta)}{\Pr(Z \vert \eta)} &= \sum_{Z} \Pr(Z \vert \eta) \log \frac{\Pr(Z \vert S,\theta) \Pr(S \vert \theta)}{\Pr(Z \vert \eta)} \\ &= \log \Pr(S \vert \theta) - \mathrm{KL}(\Pr(Z \vert \eta) \Vert \Pr(Z \vert S,\theta)) \end{aligned} Z∑Pr(Z∣η)logPr(Z∣η)Pr(S,Z∣θ)=Z∑Pr(Z∣η)logPr(Z∣η)Pr(Z∣S,θ)Pr(S∣θ)=logPr(S∣θ)−KL(Pr(Z∣η)∥Pr(Z∣S,θ))
恒等式
log Pr ( S ∣ θ ) = ∑ Z Pr ( Z ∣ η ) log Pr ( S , Z ∣ θ ) + E n t r o p y ( Pr ( Z ∣ η ) ) + K L ( Pr ( Z ∣ S , θ ) ∥ Pr ( Z ∣ η ) ) \log \Pr(S \vert \theta) = \sum_{Z} \Pr(Z \vert \eta) \log \Pr(S,Z \vert \theta) + \mathrm{Entropy}(\Pr(Z \vert \eta)) + \mathrm{KL}(\Pr(Z \vert S,\theta) \Vert \Pr(Z \vert \eta)) logPr(S∣θ)=Z∑Pr(Z∣η)logPr(S,Z∣θ)+Entropy(Pr(Z∣η))+KL(Pr(Z∣S,θ)∥Pr(Z∣η))
EM 算法
θ t + 1 = arg max θ t ∑ Z Pr ( Z ∣ S , θ t ) log Pr ( S , Z ∣ θ t ) \theta^{t+1} = \arg \max_{\theta^t} \sum_{Z} \Pr(Z \vert S,\theta^t) \log \Pr(S,Z \vert \theta^t) θt+1=argθtmaxZ∑Pr(Z∣S,θt)logPr(S,Z∣θt)
E 步: 固定参数 θ \theta θ, 优化隐变量分布参数 η \eta η. 无约束, 下界间隙 K L ( Pr ( Z ∣ η ) ∥ Pr ( Z ∣ S , θ ) ) \mathrm{KL}(\Pr(Z \vert \eta) \Vert \Pr(Z \vert S,\theta)) KL(Pr(Z∣η)∥Pr(Z∣S,θ)).M 步: 固定隐变量分布参数 η \eta η, 优化参数 θ \theta θ. 去除常量 K L ( Pr ( Z ∣ S , θ ) ∥ Pr ( Z ∣ η ) ) \mathrm{KL}(\Pr(Z \vert S,\theta) \Vert \Pr(Z \vert \eta)) KL(Pr(Z∣S,θ)∥Pr(Z∣η)) 和 E n t r o p y ( Pr ( Z ∣ η ) ) \mathrm{Entropy}(\Pr(Z \vert \eta)) Entropy(Pr(Z∣η)).
变分推断
arg min ζ K L ( Pr ( Y ∣ ζ ) ∥ Pr ( Y ∣ X ) ) \arg \min_{\zeta} \mathrm{KL}(\Pr(Y \vert \zeta) \Vert \Pr(Y \vert X)) argζminKL(Pr(Y∣ζ)∥Pr(Y∣X))
用 Pr ( Y ∣ ζ ) \Pr(Y \vert \zeta) Pr(Y∣ζ) 逼近 Pr ( Y ∣ X ) \Pr(Y \vert X) Pr(Y∣X). 可以使用梯度等方法.
应用例子1 BM RBM
BMboltzmann machine
BM 是一种最小化能量框架下的二值神经网络.
E n g ( s ) = ∑ i < j E n g ( e i j ) + ∑ i E n g ( v i ) = − ∑ i < j α i j s i s j − ∑ i β i s i \mathrm{Eng}(s) = \sum_{i < j} \mathrm{Eng}(e_{ij}) + \sum_{i} \mathrm{Eng}(v_i) = - \sum_{i < j} \alpha_{ij} s_i s_j - \sum_{i} \beta_i s_i Eng(s)=i<j∑Eng(eij)+i∑Eng(vi)=−i<j∑αijsisj−i∑βisi
BM 本质上是引入了隐变量的无向图模型. 每条边或每个顶点各自含有一个隐变量, 组成一个团.
Pr ( s ) = ∏ i < j e α i j e s i e s j ∏ i e β i e s i \Pr(s) = \prod_{i < j} e^{\alpha_{ij}} e^{s_i} e^{s_j} \prod_{i} e^{\beta_i} e^{s_i} Pr(s)=i<j∏eαijesiesji∏eβiesi
RBMrestricted boltzmann machine
E n g ( h , v ) = − ∑ i n ∑ j m α i j h i v j − ∑ i n β i h i − ∑ j m γ i v j = − h T A v − h T B − v T Γ \mathrm{Eng}(h,v) = - \sum_{i}^{n} \sum_{j}^{m} \alpha_{ij} h_i v_j - \sum_{i}^{n} \beta_i h_i - \sum_{j}^{m} \gamma_i v_j = - h^T \Alpha v - h^T \Beta - v^T \Gamma Eng(h,v)=−i∑nj∑mαijhivj−i∑nβihi−j∑mγivj=−hTAv−hTB−vTΓ
Pr ( h , v ) = e h T A v + h T B + v T Γ ∬ e h T A v + h T B + v T Γ d h d v \Pr(h,v) = \frac{\displaystyle e^{h^T \Alpha v + h^T \Beta + v^T \Gamma}}{\displaystyle \iint e^{h^T \Alpha v + h^T \Beta + v^T \Gamma} \mathrm{d} h \mathrm{d} v} Pr(h,v)=∬ehTAv+hTB+vTΓdhdvehTAv+hTB+vTΓ
极大似然
arg max θ log ∏ k K Pr ( v ( k ) ) \arg \max_{\theta} \log \prod_{k}^{K} \Pr(v^{(k)}) argθmaxlogk∏KPr(v(k))
log ∏ k K Pr ( v ( k ) ) = ∑ k K log Pr ( v ( k ) ) = ∑ k K log ∫ Pr ( v ( k ) , h ) d h = ∑ k K log ∫ e h T A v + h T B + v T Γ d h ∬ e h T A v + h T B + v T Γ d h d v = ∑ k K ( log ∫ e h T A v + h T B + v T Γ d h ) − K log ∬ e h T A v + h T B + v T Γ d h d v \begin{aligned} & \log \prod_{k}^{K} \Pr(v^{(k)}) \\ =& \sum_{k}^{K} \log \Pr(v^{(k)}) \\ =& \sum_{k}^{K} \log \int \Pr(v^{(k)},h) \mathrm{d} h \\ =& \sum_{k}^{K} \log \frac{\displaystyle \int e^{h^T \Alpha v + h^T \Beta + v^T \Gamma} \mathrm{d} h}{\displaystyle \iint e^{h^T \Alpha v + h^T \Beta + v^T \Gamma} \mathrm{d} h \mathrm{d} v} \\ =& \sum_{k}^{K} \left( \log \int e^{h^T \Alpha v + h^T \Beta + v^T \Gamma} \mathrm{d} h \right) - K \log \iint e^{h^T \Alpha v + h^T \Beta + v^T \Gamma} \mathrm{d} h \mathrm{d} v \\ \end{aligned} ====logk∏KPr(v(k))k∑KlogPr(v(k))k∑Klog∫Pr(v(k),h)dhk∑Klog∬ehTAv+hTB+vTΓdhdv∫ehTAv+hTB+vTΓdhk∑K(log∫ehTAv+hTB+vTΓdh)−Klog∬ehTAv+hTB+vTΓdhdv
d d θ log ∏ k K Pr ( v ( k ) ) = d d θ ∑ k K ( log ∫ e h T A v + h T B + v T Γ d h ) − K log ∬ e h T A v + h T B + v T Γ d h d v = ∑ k K ( ∫ e h T A v + h T B + v T Γ d d θ ( h T A v + h T B + v T Γ ) d h ∫ e h T A v + h T B + v T Γ d h ) − K ∬ e h T A v + h T B + v T Γ d d θ ( h T A v + h T B + v T Γ ) d h d v ∬ e h T A v + h T B + v T Γ d h d v = ∑ k K ( ∫ Pr ( h ∣ v ) d d θ ( h T A v + h T B + v T Γ ) d h ) − K ∬ Pr ( h , v ) d d θ ( h T A v + h T B + v T Γ ) d h d v = ∑ k K E h ∼ Pr ( h ∣ v ) [ h v T d A + h d B + v d Γ ] − K E h , v ∼ Pr ( h , v ) [ h v T d A + h d B + v d Γ ] \begin{aligned} & \frac{\mathrm{d}}{\mathrm{d} \theta} \log \prod_{k}^{K} \Pr(v^{(k)}) \\ =& \frac{\mathrm{d}}{\mathrm{d} \theta} \sum_{k}^{K} \left( \log \int e^{h^T \Alpha v + h^T \Beta + v^T \Gamma} \mathrm{d} h \right) - K \log \iint e^{h^T \Alpha v + h^T \Beta + v^T \Gamma} \mathrm{d} h \mathrm{d} v \\ =& \sum_{k}^{K} \left( \frac{\displaystyle \int e^{h^T \Alpha v + h^T \Beta + v^T \Gamma} \frac{\mathrm{d}}{\mathrm{d} \theta} (h^T \Alpha v + h^T \Beta + v^T \Gamma) \mathrm{d} h}{\displaystyle \int e^{h^T \Alpha v + h^T \Beta + v^T \Gamma} \mathrm{d} h} \right) - K \frac{\displaystyle \iint e^{h^T \Alpha v + h^T \Beta + v^T \Gamma} \frac{\mathrm{d}}{\mathrm{d} \theta} (h^T \Alpha v + h^T \Beta + v^T \Gamma) \mathrm{d} h \mathrm{d} v}{\displaystyle \iint e^{h^T \Alpha v + h^T \Beta + v^T \Gamma} \mathrm{d} h \mathrm{d} v} \\ =& \sum_{k}^{K} \left( \int \Pr(h \vert v) \frac{\mathrm{d}}{\mathrm{d} \theta} (h^T \Alpha v + h^T \Beta + v^T \Gamma) \mathrm{d} h \right) - K \iint \Pr(h,v) \frac{\mathrm{d}}{\mathrm{d} \theta} (h^T \Alpha v + h^T \Beta + v^T \Gamma) \mathrm{d} h \mathrm{d} v \\ =& \sum_{k}^{K} \mathop{\mathbb{E}}\limits_{h \sim \Pr(h \vert v)} \left[ h v^T \mathrm{d} \Alpha + h \mathrm{d} \Beta + v \mathrm{d} \Gamma \right] - K \mathop{\mathbb{E}}\limits_{h,v \sim \Pr(h,v)} \left[ h v^T \mathrm{d} \Alpha + h \mathrm{d} \Beta + v \mathrm{d} \Gamma \right] \\ \end{aligned} ====dθdlogk∏KPr(v(k))dθdk∑K(log∫ehTAv+hTB+vTΓdh)−Klog∬ehTAv+hTB+vTΓdhdvk∑K⎝⎜⎜⎛∫ehTAv+hTB+vTΓdh∫ehTAv+hTB+vTΓdθd(hTAv+hTB+vTΓ)dh⎠⎟⎟⎞−K∬ehTAv+hTB+vTΓdhdv∬ehTAv+hTB+vTΓdθd(hTAv+hTB+vTΓ)dhdvk∑K(∫Pr(h∣v)dθd(hTAv+hTB+vTΓ)dh)−K∬Pr(h,v)dθd(hTAv+hTB+vTΓ)dhdvk∑Kh∼Pr(h∣v)E[hvTdA+hdB+vdΓ]−Kh,v∼Pr(h,v)E[hvTdA+hdB+vdΓ]
对比散度 Contrastive Dìvergence
考虑第 k 个样本的似然微分.
积分求期望的代价不可接受. (例如, 原始的二值神经网络, 求和 2 n + m 2^{n+m} 2n+m 次.)
从积分的角度, 积分是概率的规范化和边缘化, 采用中值定理简化.从期望的角度, 期望是随机变量的代表性取值, 通过采样代表简化.
∫ Pr ( h ∣ v ) d d θ ( h T A v + h T B + v T Γ ) d h − ∬ Pr ( h , v ) d d θ ( h T A v + h T B + v T Γ ) d h d v = E h ∼ Pr ( h ∣ v ) [ h v T d A + h d B + v d Γ ] − E h , v ∼ Pr ( h , v ) [ h v T d A + h d B + v d Γ ] ≈ ( h v T d A + h d B + v d Γ ) − ( h ′ v ′ T d A + h ′ d B + v ′ d Γ ) \begin{aligned} & \int \Pr(h \vert v) \frac{\mathrm{d}}{\mathrm{d} \theta} (h^T \Alpha v + h^T \Beta + v^T \Gamma) \mathrm{d} h - \iint \Pr(h,v) \frac{\mathrm{d}}{\mathrm{d} \theta} (h^T \Alpha v + h^T \Beta + v^T \Gamma) \mathrm{d} h \mathrm{d} v \\ =& \mathop{\mathbb{E}}\limits_{h \sim \Pr(h \vert v)} \left[ h v^T \mathrm{d} \Alpha + h \mathrm{d} \Beta + v \mathrm{d} \Gamma \right] - \mathop{\mathbb{E}}\limits_{h,v \sim \Pr(h,v)} \left[ h v^T \mathrm{d} \Alpha + h \mathrm{d} \Beta + v \mathrm{d} \Gamma \right] \\ \approx& (h v^T \mathrm{d} \Alpha + h \mathrm{d} \Beta + v \mathrm{d} \Gamma) - (h' v'^T \mathrm{d} \Alpha + h' \mathrm{d} \Beta + v' \mathrm{d} \Gamma) \\ \end{aligned} =≈∫Pr(h∣v)dθd(hTAv+hTB+vTΓ)dh−∬Pr(h,v)dθd(hTAv+hTB+vTΓ)dhdvh∼Pr(h∣v)E[hvTdA+hdB+vdΓ]−h,v∼Pr(h,v)E[hvTdA+hdB+vdΓ](hvTdA+hdB+vdΓ)−(h′v′TdA+h′dB+v′dΓ)
上式采样 1 次, 可推广为多次采样. 采样方式为 gibbs sampling.
Pr ( h ∣ v ) = ∏ i n Pr ( h i ∣ v ) Pr ( v ∣ h ) = ∏ j m Pr ( v j ∣ h ) \begin{aligned} \Pr(h \vert v) = \prod_{i}^{n} \Pr(h_i \vert v) \\ \Pr(v \vert h) = \prod_{j}^{m} \Pr(v_j \vert h) \\ \end{aligned} Pr(h∣v)=i∏nPr(hi∣v)Pr(v∣h)=j∏mPr(vj∣h)
这一简化很像负采样 (negative sampling).
应用示例2 概率图模型中网络结构估计
Pr ( G ∣ S ) ∝ ∑ θ Pr ( S ∣ θ , G ) Pr ( θ ∣ G ) Pr ( G ) \Pr(G \vert S) \propto \sum_{\theta} \Pr(S \vert \theta,G) \Pr(\theta \vert G) \Pr(G) Pr(G∣S)∝θ∑Pr(S∣θ,G)Pr(θ∣G)Pr(G)
其中 Pr ( θ ∣ G ) \Pr(\theta \vert G) Pr(θ∣G) 是给定网络结构的参数的先验. (网络结构不同, 参数的结构不同.)
网络结构是离散的, 通常认为 Pr ( G ) \Pr(G) Pr(G) 是均匀分布.
Pr ( G ∣ S ) ∝ ∑ θ Pr ( S ∣ θ , G ) Pr ( θ ∣ G ) \Pr(G \vert S) \propto \sum_{\theta} \Pr(S \vert \theta,G) \Pr(\theta \vert G) Pr(G∣S)∝θ∑Pr(S∣θ,G)Pr(θ∣G)
限制为: 有限种取值的离散随机变量组成的有向图模型
可以贪心的搜索网络结构 (例如: 爬山法, beam search, 分支限界, 遗传算法, …).