
机器学习 - 理论 -广义线性模型
- 前言
- 广义线性模型定义:
- 指数分布族:
- Gaussian分布的指数分布族形式:
- Bernouli分布的指数分布族形式:
- 广义线性模型建模三大假设:
- 广义线性模型推导其他公式
- 推导线性回归方程
- 推导逻辑回归
- 推导softmax多分类算法
- 后续:
前言
Vue框架:
从项目学Vue
OJ算法系列:
神机百炼 - 算法详解
Linux操作系统:
风后奇门 - linux
C++11:
通天箓 - C++11
python cook book:
逐个熟悉常见模块
广义线性模型定义:
指数分布族:
p(y;η)=b(y)∗e(T(y)∗ηT−a(η))p(y;η) = b(y)*e^{(T(y)*η^{T} - a(η))}p(y;η)=b(y)∗e(T(y)∗ηT−a(η))
- η:自然参数
- T(y):充分统计量,大多为y
- a(η):log partition function,用于正规化常量,保证 $ \sum{}{}p(y;η) = 1 $
Gaussian分布的指数分布族形式:
- 高斯分布公式:
f(x)=12π∗δ∗e−(x−u)22δ2f(x)= \frac{1}{\sqrt{2π}*δ}*e^{-\frac{(x−u)^2}{2δ^2}}f(x)=2π∗δ1∗e−2δ2(x−u)2 - 线性回归中,δ对于模型参数θ的选择没有影响,为了推导方便我们将其设为1:
p(y;μ)=12πe−12(y−μ)2p(y;μ) = \frac{1}{\sqrt{2π}}e^{-\frac{1}{2}(y-μ)^2}p(y;μ)=2π1e−21(y−μ)2 - 分离y^2得:
p(y;μ)=12πe−12y2∗eμy−12μ2p(y;μ) = \frac{1}{\sqrt{2π}}e^{-\frac{1}{2}y^2}*e^{μy-\frac{1}{2}μ^2}p(y;μ)=2π1e−21y2∗eμy−21μ2 - 指数分布族系数:
η=μη = μη=μ
T(y)=yT(y) = yT(y)=y
a(η)=μ22=η22a(η) = \frac{μ^2}{2} = \frac{η^2}{2}a(η)=2μ2=2η2
b(η)=12πe−y22b(η) = \frac{1}{\sqrt{2π}}e^\frac{-y^2}{2}b(η)=2π1e2−y2
Bernouli分布的指数分布族形式:
- 伯努利分布公式:φ为正面事件发生概率
p(y;φ)=φy∗(1−φ)(1−y)p(y;φ) = φ^{y} * (1-φ)^{(1-y)}p(y;φ)=φy∗(1−φ)(1−y) - 逻辑回归服从伯努利分布
p(y=1;φ)=φp(y=1;φ) = φp(y=1;φ)=φ
p(y=0;φ)=1−φp(y=0;φ) = 1-φp(y=0;φ)=1−φ - 上底数e,指数取对数得:
p(y;φ)=ey∗logφ∗e(1−y)∗log(1−φ)p(y;φ) = e^{y*log^{φ}} * e^{(1-y)*log^{(1-φ)}}p(y;φ)=ey∗logφ∗e(1−y)∗log(1−φ)
p(y;φ)=e[y∗logφ+(1−y)∗log(1−φ)]p(y;φ) = e^{[y*log^{φ} + (1-y)*log^{(1-φ)}]}p(y;φ)=e[y∗logφ+(1−y)∗log(1−φ)]
p(y;φ)=e[y∗logφ−y∗log(1−φ)+log(1−φ)]p(y;φ) = e^{[y*log^{φ} -y*log^{(1-φ)} + log^{(1-φ)}]}p(y;φ)=e[y∗logφ−y∗log(1−φ)+log(1−φ)] - 合并系数y得:
p(y;φ)=e[y∗(logφ1−φ)+log(1−φ)]p(y;φ) = e^{[y*(log^{\frac{φ}{1-φ}})+ log^{(1-φ)}]}p(y;φ)=e[y∗(log1−φφ)+log(1−φ)] - 套用指数分布族得:
η=logφ1−φη = log^{\frac{φ}{1-φ}}η=log1−φφ
φ=11+e−ηφ = \frac{1}{1+e^{-η}}φ=1+e−η1
b(y)=1b(y) = 1b(y)=1
T(y)=yT(y) = yT(y)=y
a(y)=−log(1−φ)=log(1+eη)a(y) = -log^{(1-φ)} = log^{(1+e^{η})}a(y)=−log(1−φ)=log(1+eη)
广义线性模型建模三大假设:
- 假设1:y的条件概率属于指数分布族
y∣x;θ∽ExponentialFamily(η)y|x;θ ∽ ExponentialFamily(η)y∣x;θ∽ExponentialFamily(η) - 假设2:
- 给定x 广义线性模型的目标是求解T(y)∣xT(y)|xT(y)∣x
- 不过由于很多情况下T(y)∣x=yT(y)|x = yT(y)∣x=y,所以目标为求解y∣xy|xy∣x
- 也就是拟合函数为h(x)=E[y∣x]h(x) = E[y|x]h(x)=E[y∣x]
- 如逻辑回归的hθ(x) = p(y=1|x;θ) = 0*p(y=0|x;θ)+1*p(y=1|x;θ) = E[y|x;θ]
- 假设3:自然参数η与x是线性关系
η=θTxη = θ^Txη=θTx - 若η为向量,则 $ η_{i} = θ_{i}^{T}x $
广义线性模型推导其他公式
推导线性回归方程
- 线性回归服从的高斯分布:
y∣x;θ∽N(μ,θ)y|x;θ ∽N(μ,θ)y∣x;θ∽N(μ,θ) - 由假设2,拟合函数 h(x)=E[y∣x]h(x) = E[y|x]h(x)=E[y∣x]
h(x)=E[y∣x;θ]=μh(x) = E[y|x;θ] = μh(x)=E[y∣x;θ]=μ - 已知高斯分布的广义线性模型:
η=μη = μη=μ - 由此可得:
h(x)=ηh(x) = ηh(x)=η - 由假设三:
h(x)=θTxh(x) = θ^Txh(x)=θTx
推导逻辑回归
- 逻辑回归服从的伯努利分布:
y∣x;θ∽Bernoulli(φ)y|x;θ ∽Bernoulli(φ)y∣x;θ∽Bernoulli(φ) - 由假设2,拟合函数 h(x)=E[y∣x]h(x) = E[y|x]h(x)=E[y∣x]
h(x)=E[y∣x;θ]=φh(x) = E[y|x;θ] = φh(x)=E[y∣x;θ]=φ - 已知伯努利分布的广义线性模型:
η=logφ1−φη = log^{\frac{φ}{1-φ}}η=log1−φφ
φ=11+e−ηφ = \frac{1}{1+e^{-η}}φ=1+e−η1 - 由此可得:
h(x)=11+e−ηh(x) = \frac{1}{1+e^{-η}}h(x)=1+e−η1 - 由假设三:
h(x)=11+e−θTxh(x) = \frac{1}{1+e^{-θ^Tx}}h(x)=1+e−θTx1
推导softmax多分类算法
- y有多种可能取值,每种取值概率也不同:
[y1y2...yk]⋅[φ1φ2...1−∑i=1k−1φi]\begin{bmatrix} y_{1} \\ y_{2} \\ ... \\ y_{k} \\ \end{bmatrix} \cdot \begin{bmatrix} φ_{1} \\ φ_{2} \\ ... \\ 1- \displaystyle\sum_{i=1}^{k-1}φ_{i} \\ \end{bmatrix} y1y2...yk⋅φ1φ2...1−i=1∑k−1φi - {y=i} 表示 最终分类到第i类的概率,可以用矩阵T(y)表达:
T(i)=[00...第i个位置为1...0]T(i) = \begin{bmatrix} 0 \\ 0 \\ ... \\ 第i个位置为1\\ ... \\ 0 \\ \end{bmatrix} T(i)=00...第i个位置为1...0 - 多分类指数分布族:
p(y;φ)=φ11∗y=1∗φ21∗y=2∗...∗φk1∗y=kp(y;φ) = φ_{1}^{1*{y=1}} * φ_{2}^{1*{y=2}} * ... * φ_{k}^{1*{y=k}}p(y;φ)=φ11∗y=1∗φ21∗y=2∗...∗φk1∗y=k
p(y;φ)=φ1T(y1)∗φ2T(y2)∗...∗φkT(ykp(y;φ) = φ_{1}^{T(y_{1})} * φ_{2}^{T(y_{2})} * ... * φ_{k}^{T(y_{k}}p(y;φ)=φ1T(y1)∗φ2T(y2)∗...∗φkT(yk - 底数取e,指数取ln:
p(y;φ)=eT(y1)∗logφ1+T(y2)∗logφ2+...+(1−∑i=1k−1T(yi))∗logφkp(y;φ) = e^{T(y_{1}) * log^{φ_{1}} + T(y_{2}) * log^{φ_{2}} + ... + (1-\displaystyle \sum_{i=1}^{k-1}T(y_{i})) * log^{φ_{k}}}p(y;φ)=eT(y1)∗logφ1+T(y2)∗logφ2+...+(1−i=1∑k−1T(yi))∗logφk - 将 ∑i=1k−1T(yi)\sum_{i=1}^{k-1}T(y_{i})∑i=1k−1T(yi) 展开分给前面:
p(y;φ)=eT(y1)∗logφ1φk+T(y2)∗logφ2φk+...+T(yk−1)∗logφk−1φk+logφkp(y;φ) = e^{T(y_{1}) * log^{\frac{φ_{1}}{φ_{k}}} + T(y_{2}) * log^{\frac{φ_{2}}{φ_{k}}} + ... + T(y_{k-1}) * log^{\frac{φ_{k-1}}{φ_{k}}} + log^{φ_{k}}}p(y;φ)=eT(y1)∗logφkφ1+T(y2)∗logφkφ2+...+T(yk−1)∗logφkφk−1+logφk - 最终得到:
η=[log(φ1φk)log(φ2φk)...log(φk−1φk)]η = \begin{bmatrix} log^{(\frac{φ_{1}}{φ_{k}})} \\ log^{(\frac{φ_{2}}{φ_{k}})} \\ ... \\ log^{(\frac{φ_{k-1}}{φ_{k}})} \\ \end{bmatrix}η=log(φkφ1)log(φkφ2)...log(φkφk−1)
b(y)=1b(y) = 1b(y)=1
a(y)=−logφka(y) = -log^{φ_{k}} a(y)=−logφk
- 进一步变型η:
ηi=log(φiφk)η_{i} = log^{(\frac{φ_{i}}{φ_{k}})}ηi=log(φkφi)
eηi=φiφke^{η_{i}} = \frac{φ_{i}}{φ_{k}}eηi=φkφi
eηi∗φk=φie^{η_{i}} * φ_{k} = φ_{i}eηi∗φk=φi
φk∗∑i=1keηi=∑i=1kφi=1φ_{k}*\sum_{i=1}^{k}e^{η_{i}} = \sum_{i=1}^{k}φ_{i} = 1φk∗i=1∑keηi=i=1∑kφi=1
φk=1∑i=1keηiφ_{k} = \frac{1}{\sum_{i=1}^{k}e^{η_{i}}}φk=∑i=1keηi1 - 所以:
φi=eηi∑j=1keηjφ_{i} = \frac{e^{η_{i}}}{\sum_{j=1}^{k}e^{η_{j}}}φi=∑j=1keηjeηi
p(y=i∣x;θ)=φip(y=i|x;θ) = φ_{i}p(y=i∣x;θ)=φi - 由假设三得:
p(y=i∣x;θ)=eθiTx∑j=1keθjTxp(y=i|x;θ) = \frac{e^{θ_{i}^{T}x}}{\sum_{j=1}^{k}e^{θ_{j}^{T}x}}p(y=i∣x;θ)=∑j=1keθjTxeθiTx - 所以hθ(x)为:
hθ(x)=E[T(y)∣x;θ]hθ(x) = E[T(y)|x;θ]hθ(x)=E[T(y)∣x;θ]
hθ(x)=[φ1φ2...φk]hθ(x) = \begin{bmatrix} φ_{1} \\ φ_{2} \\ ... \\ φ_{k} \\ \end{bmatrix}hθ(x)=φ1φ2...φk
hθ(x)=[eθ1Tx∑j=1keθjTxeθ2Tx∑j=1keθjTx...eθk−1Tx∑j=1keθjTx]hθ(x) = \begin{bmatrix} \frac{e^{θ_{1}^{T}x}}{\sum_{j=1}^{k}e^{θ_{j}^{T}x}} \\ \frac{e^{θ_{2}^{T}x}}{\sum_{j=1}^{k}e^{θ_{j}^{T}x}} \\ ... \\ \frac{e^{θ_{k-1}^{T}x}}{\sum_{j=1}^{k}e^{θ_{j}^{T}x}} \\ \end{bmatrix}hθ(x)=∑j=1keθjTxeθ1Tx∑j=1keθjTxeθ2Tx...∑j=1keθjTxeθk−1Tx - 最大似然估计得:
l(θ)=∑i=1mlogp(yi∣xi;θ)l(θ) = \displaystyle\sum_{i=1}^{m}log^{p(y^i|x^i; θ)}l(θ)=i=1∑mlogp(yi∣xi;θ)
l(θ)=∑i=1mlog∏l=1k(eθk−1Tx∑j=1keθjTx)1∗y(i)=ll(θ) = \displaystyle\sum_{i=1}^{m}log^{\prod_{l=1}^{k}(\frac{e^{θ_{k-1}^{T}x}}{\sum_{j=1}^{k}e^{θ_{j}^{T}x}})^{1*{y(i) = l}}}l(θ)=i=1∑mlog∏l=1k(∑j=1keθjTxeθk−1Tx)1∗y(i)=l
后续:
- 博文为博主手写latex推导,需要LinearRegression、LogisticRegression、softmax基础
- 更多背景讲解见于吴恩达机器学习经典课程:stanford - cs229