# 写在前面

介绍最近看的基于p\ell_p范数主成分分析(Principal Component Analysis, PCA)的文章[1]

# 基于p\ell_p范数的主成分分析

  • 2\ell_2-PCA

F2(W)=12i=1NWTxi22=12tr(WTXXTW)F_2 (W) = \frac{1}{2}\sum_{i=1}^{N} \|W^T x_i\|_2^2 = \frac{1}{2} tr(W^T XX^T W)

  • 1\ell_1-PCA

F1(W)=i=1NWTxi1=i=1Nj=1mwjTxiF_1 (W) = \sum_{i=1}^{N} \|W^T x_i\|_1 = \sum_{i=1}^{N} \sum_{j=1}^{m} |w_j^T x_i|

  • p\ell_p-PCA

Fp(W)=1pi=1NWTxipp=1pi=1Nj=1mwjTxipF_p (W) = \frac{1}{p} \sum_{i=1}^{N} \|W^T x_i\|_p^p = \frac{1}{p} \sum_{i=1}^{N} \sum_{j=1}^{m} |w_j^T x_i|^p

# 求解算法(m=1m=1)

# 梯度下降

对应模型

w=argminwFp(w)=argminw1pi=1NwTxips.t.wTw=1w^* = \arg\min_w F_p(w) = \arg\min_w \frac{1}{p} \sum_{i=1}^{N} |w^T x_i|^p \quad \text{s.t.} w^Tw = 1

ai=wTxia_i = w^T x_i,则 Fp(w)=1pi=1N(sgn(ai)ai)pF_p (w) = \frac{1}{p} \sum_{i=1}^{N} (\text{sgn}(a_i)a_i)^pww的梯度为

w=dFp(w)dw=i=1NdFp(w)daidaidw=i=1N[sgn(ai)ai]p1[sgn(ai)ai+sgn(ai)]xi=i=1Nsgn(ai)sgnp1(ai)aipxi+i=1Nsgnp(ai)aip1xi=2i=1Nδ(ai)sgnp1(ai)aipxi+i=1Nsgn(ai)aip1xi\begin{aligned} \nabla_{w} &=\frac{d F_{p}(w)}{d w}=\sum_{i=1}^{N} \frac{d F_{p}(w)}{d a_{i}} \frac{d a_{i}}{d w} \\ &=\sum_{i=1}^{N}\left[\text{sgn}\left(a_{i}\right) a_{i}\right]^{p-1}\left[\text{sgn}^{\prime}\left(a_{i}\right) a_{i}+\text{sgn}\left(a_{i}\right)\right] x_{i} \\ &=\sum_{i=1}^{N} \text{sgn}^{\prime}\left(a_{i}\right) \text{sgn}^{p-1}\left(a_{i}\right) a_{i}^{p} x_{i}+\sum_{i=1}^{N} \text{sgn}^{p}\left(a_{i}\right) a_{i}^{p-1} x_{i} \\ &=2 \sum_{i=1}^{N} \delta\left(a_{i}\right) \text{sgn}^{p-1}\left(a_{i}\right) a_{i}^{p} x_{i}+\sum_{i=1}^{N} \text{sgn}\left(a_{i}\right)\left|a_{i}\right|^{p-1} x_{i} \end{aligned}

ai=wTxi0a_i = w^T x_i \neq 0,则梯度的第一项为0,即

w=i=1Nsgn(wTxi)wTxip1xi\nabla_{w} = \sum_{i=1}^{N} \text{sgn}\left(w^T x_i\right)\left|w^T x_i\right|^{p-1} x_{i}

p>1p > 1,即使存在奇异点(wTxi=0w^T x_i = 0),第一项也为0,其梯度也是良定义的(见上式)。而当p<1p < 1时需要对ww进行扰动来避免奇异情况(wTxi=0w^T x_i = 0)。

该问题可以通过最速下降法来迭代求解。

  • 初值选取
    • 2\ell_2-PCA的结果
    • 具有最大范数的样本方向
  • 学习率控制收敛速率,α=0.1N\alpha = \frac{0.1}{N}
  • step 2 避免当p1p \leq 1时的奇异情况
  • step 5 保证规范化w2=1\|w\|_2 = 1

基于梯度下降的PCA-Lp算法

# 梯度正交向量

w=ww(wTw)=(IdwwT)w=(IdwwT)i=1Nsgn(wTxi)wTxip1xi\begin{aligned} \nabla_{w}^{\perp} &=\nabla_{w}-w\left(w^{T} \nabla_{w}\right)=\left(I_{d}-w w^{T}\right) \nabla_{w} \\ &=\left(I_{d}-w w^{T}\right) \sum_{i=1}^{N} \text{sgn}\left(w^{T} x_{i}\right)\left|w^{T} x_{i}\right|^{p-1} x_{i} \end{aligned}

ci=sgn(wTxi)wTxip1,vi=(IdwwT)xi,fi=civic_i = \text{sgn}\left(w^{T} x_{i}\right)\left|w^{T} x_{i}\right|^{p-1}, v_i = \left(I_{d}-w w^{T}\right) x_{i}, f_i = c_i v_i

w=i=1Ncivi=i=1Nfi\nabla_{w}^{\perp} = \sum_{i=1}^{N} c_i v_i = \sum_{i=1}^{N} f_i

  • ai=wTxi,vi=xiw(wTxi)a_i = |w^T x_i|,v_i = x_i - w(w^T x_i)
  • ww过原点OO旋转
  • 每一个样本 xix_iww 施加一个正交的力 fif_i
    • wTxi>0w^T x_i > 0 ,点 xix_iww 产生拉力
    • wTxi<0w^T x_i < 0 ,点 xix_iww 产生推力,反之对 w-w 产生拉力
    • 力的大小为fi=aip1vi|f_i| = a_i^{p-1}|v_i|
      • p=2p=2时,fi=aivi|f_i| = a_i|v_i|,力收到两方面的乘积影响(拟合性)
      • p=1p=1时,fi=vi|f_i| = |v_i|,力只收到一方面的乘积影响(鲁棒性)
      • p<1p<1时,fi=(viai)1pvip|f_i| = (\frac{|v_i|}{a_i})^{1-p}|v_i|^paia_i对力产生负影响,从而降低异常值的干扰
      • p0p \to 0时,fi=viai|f_i| = \frac{|v_i|}{a_i}

# 传统PCA联系

p=2p = 2

w=i=1Nsgn(wTxi)wTxip1xi=i=1NxixiTw\nabla_{w} = \sum_{i=1}^{N} \text{sgn}\left(w^T x_i\right)\left|w^T x_i\right|^{p-1} x_{i} = \sum_{i=1}^{N} x_{i} x_i^T w

w=(IdwwT)i=1Nsgn(wTxi)wTxip1xi=(IdwwT)XXTw\nabla_{w}^{\perp} =\left(I_{d}-w w^{T}\right) \sum_{i=1}^{N} \text{sgn}\left(w^{T} x_{i}\right)\left|w^{T} x_{i}\right|^{p-1} x_{i} = \left(I_{d}-w w^{T}\right) XX^T w

因此优化问题可以通过协方差矩阵的特征值分解解决。而梯度正交向量有如下性质:

w=0w=XXTw\nabla_{w}^{\perp} = 0 \iff \nabla_{w} = XX^T w

# 拉格朗日乘子法

约束优化问题转化为如下

L(w,λ)=Fp(w)+λ(wTw1)L(w, \lambda ) = F_p(w) + \lambda (w^T w - 1)

令拉格朗日函数导数为0可得最优解的必要性

dL(w,λ)dw=w+λw=0\frac{dL(w,\lambda)}{dw} = \nabla_w + \lambda w = 0

可得ww与梯度w\nabla_w平行,又w2=1\|w\|_2=1,因此可对ww直接更新

www2w \leftarrow \frac{\nabla_w}{\|\nabla_w\|_2}

通常情况下,这种令导数为零方法取到的不仅是最大值,也有可能是最小值。因此在p\ell_p-PCA里,对p1p \geq 1,由FpF_p的凸性可得该迭代下的目标函数非减。

Fp(wk+1)Fp(wk)+wT(wk+1wk)Fp(wk)F_p(w^{k+1}) \geq F_p(w^{k}) + \nabla_w^T (w^{k+1} - w^{k}) \geq F_p(w^{k})

第二个不等号是因为wk+12=wk2=1\|w^{k+1}\|_2 = \|w^{k}\|_2 = 1wk+1w^{k+1}平行于w\nabla_wwTwk+1=1\nabla_w^T w^{k+1} = 1

基于Lagrangian乘子法的PCA-Lp算法

# 求解算法(m>1m>1)

# 贪婪算法(近似求解)

  • 往往求得局部最优解,而不是全局最优解

# 非贪婪解

对目标函数求梯度

W=dFp(W)dW=[w1,,wm]\nabla_W = \frac{dF_p(W)}{dW} = [\nabla_{w_1},\dots,\nabla_{w_m}]

对应的拉格朗日函数为

L(W,Γm)=Fp(W)+tr(ΓmT(WTWIm))L(W, \Gamma_m ) = F_p(W) + tr(\Gamma_m^T (W^T W - I_m))

同样的,设导数为0,即

dL(W,Γm)dW=W+2WΓm=0\frac{dL(W, \Gamma_m )}{dW} = \nabla_W + 2W\Gamma_m = 0

迭代不能再是简单的赋值,需满足正交约束,因此考虑如下优化问题

W=argmaxQtr(QTW)s.t.QTQ=ImW^* = \arg\max_Q tr(Q^T \nabla_W) \quad \text{s.t.} Q^T Q = I_m

W\nabla_W的SVD分解为W=UΛVT\nabla_W = U\Lambda V^T,记Z=VTQTUZ = V^T Q^T U,得ZZT=Id,zii1ZZ^T = I_d,z_{ii}\leq 1。因此

tr(QTW)=tr(QTUΛVT)=tr(ΛVTQTU)=tr(ΛZ)=iλiiziiiλii\begin{aligned} tr(Q^T \nabla_W) &= tr(Q^T U\Lambda V^T) = tr(\Lambda V^T Q^T U) \\ &= tr(\Lambda Z) = \sum_i \lambda_{ii}z_{ii} \leq \sum_i \lambda_{ii} \end{aligned}

取等号时当且仅当zii=1z_{ii} = 1,此时Z=[Im0]Z = [I_m | 0],最优解

W=UZTVT=U[Im0]TVTW^* = U Z^T V^T = U [I_m | 0]^T V^T

# 参考文献


  1. Kwak N. Principal component analysis by Lp-norm maximization. IEEE Trans Cybern. 2014;44(5):594-609. ↩︎