Gaussian process and Brownian motion
Lecture 1
Brownian motion
- Brownian motion: limit of symmetric random walk
- taking smaller and smaller steps in smaller and smaller time intervals
- each \(\Delta t\) time unit we take a step of size \(\Delta x\) either to the left or the right equal likely
- let \(\Delta x=\sigma\sqrt{\Delta t}\)
- \(X(t)=\Delta x(X_1+\ldots+X_{[t/\Delta t]})\): independent \(\{X_i=1,-1 \mbox{ with prob } 1/2\}\)
- Var[X(t)]=t\(\sigma^2\), E[X(t)]=0
- let \(\Delta t\to 0\)
- X(t) is normal with mean 0 and variance \(\sigma^2 t\) (CLT)
- \(\{X(t),t\geq 0\}\) have independent and stationary increments (random walk property)
- A stochastic process \(\{X(t),t\geq 0\}\) is said to be a Brownian motion (Wiener) process if
- X(0) = 0;
- \(\{X(t),t\geq 0\}\) has stationary and independent increments;
- for every t > 0, X(t) is normally distributed with mean 0 and variance \(\sigma^2t\).
- assuming \(\sigma=1\), the process is called standard Brownian motion
- density of \(X(t)\), \(f_t(x)=e^{-x^2/2t}/\sqrt{2\pi t}\)
- Joint density of \(X(t_1),X(t_2),\ldots,X(t_n)\), \(t_1<\ldots<t_n\)
- independent increments -> \(X(t_1),X(t_2)−X(t_1),\ldots,X(t_n)−X(t_{n−1})\) are independent
- stationary increments -> \(X(t_k)−X(t_{k-1}) \sim N(0,t_k-t_{k-1})\)
- so \(f(x_1,\ldots,x_n)=f_{t_1}(x_1)\prod_{i=2}^nf_{t_i-t_{i-1}}(x_i-x_{i-1})\)
- conditional distribution of X(s) given that X(t)=B where s < t
- \(f_{s|t}(x|B)=\frac{f_s(x)f_{t-s}(B-x)}{f_t(B)}\)
- a normal dist with mean \(Bs/t\), and variance \((t-s)s/t\)
Hitting Times, Maximum Variable, and the Gambler's Ruin Problem
- \(T_a\): the first time the Brownian motion process hits \(a\)
- When \(a>0\), consider \(\Pr[X(t)\geq a]\)
- \(\Pr[X(t)\geq a]=\Pr[X(t)\geq a|T_a\leq t]\Pr(T_a\leq t)+\Pr[X(t)\geq a|T_a>t]\Pr(T_a>t)\)
- \(\Pr[X(t)\geq a|T_a\leq t]=1/2\) (symmetry)
- \(\Pr[X(t)\geq a|T_a>t]=0\) (continuity)
- \(=\Pr(T_a\leq t)/2\)
- So \(\Pr(T_a\leq t)=2\Pr[X(t)\geq a]=2[1-\Phi(a/\sqrt{t})]\)
- When \(a<0\), by symmetry, \(T_a\) has the same dist as \(T_{-a}\). Hence
- \(\Pr(T_a\leq t)=2\Pr[X(t)\geq |a|]=2\Phi(-|a|/\sqrt{t}), \forall a\)
- the maximum value the process attains in [0,t]
- \(\Pr[\max_{s\in[0,t]}X(s)\geq a]=\Pr(T_a\leq t)=2\Phi(-a/\sqrt{t}), a>0\).
Lecture 2
White Noise Transformation
- f be a function having a continuous derivative in the region [a, b].
- {X(t),t\(\geq\) 0} denote a standard Brownian motion process
- Define \(\int_a^b f(t)dX(t)=\lim_{n\to\infty,\max(t_i-t_{i-1})\to 0} \sum_{i=1}^n f(t_{i-1})[X(t_i)-X(t_{i-1})]\)
- \(=f(b)X(b)-f(a)X(a)-\int_a^bX(t)df(t)\) (integration by parts).
- \(E[\int_a^b f(t)dX(t)]=0\)
- \(Var[\int_a^b f(t)dX(t)]=\int_a^bf^2(t)dt\)
Gaussian Processes
- A stochastic process X(t), \(t\geq 0\) is called a Gaussian, or a normal, process if X\((t_1),\ldots,X(t_n)\) has a multivariate normal distribution for all \(t_1,\ldots,,t_n\).
- Brownian process \(\{X(t),t\geq 0\}\) is Gaussian process.
- For the Brownian motion process, each of \(X(t_1),\ldots,X(t_n)\) can be expressed as a linear combination of the independent normal random variables \(X(t_1),X(t_2)−X(t_1),\ldots,X(t_n)−X(t_{n−1})\),
- it is then a Gaussian process.
- \(E[X(t)]=0, Var[X(t)]=t\)
- \(\forall s\lt t, Cov[X(s),X(t)]=Cov[X(s),X(s)]+Cov[X(s),X(t)−X(s)]=s\) (independent increments)
- Brownian bridge (conditional Brownian motion)
- Given a standard Brownian process \(\{X(t),t\geq 0\}\)
- \(\{X(t), 0\leq t\leq 1|X(1)=0\}\) is a Brownian bridge, also a Gaussian process
- Conditional multivariate normal dist is still multivariate normal
- \(E[X(s)|X(1)=0]=0, \forall s<1\)
- \(Cov[(X(s),X(t))|X(1)=0]\)
- \(=E[X(s)X(t)|X(1)=0]=E[E[X(s)X(t)|X(t),X(1)=0]|X(1)=0]=E[X(t)E[X(s)|X(t)]|X(1)=0]\)
- \(=E[X(t)(s/t)X(t)|X(1)=0]=(s/t)E[X^2(t)|X(1)=0]\)
- \(=t(1-t)s/t=s(1-t)\)
- Proposition 10.1 If \(\{X(t),t\geq 0\}\) is standard Brownian motion, then \(Z(t)=X(t)-tX(1), 0\leq t\leq 1\) is a Brownian bridge process.
- Z(t) is a Gaussian process
- easy to check \(E[Z(t)]=0\) and \(Cov[Z(s),Z(t)]=s(1−t)\), when \(s\leq t\)
- integrated Brownian motion
- Given a Brownian motion \(\{X(t),t\geq 0\}\)
- \(Z(t)=\int_0^t X(s)ds\) is called an integrated Brownian motion
- \(E[Z(t)]=0\) (since E[X(s)]=0)
- \(Cov[Z(s),Z(t)]=s^2(t/2-s/6)\).
Lecture 3
Stationary and Weakly Stationary Processes
- A stochastic process \(\{X(t),t\geq 0\}\) is said to be a stationary process if \(\forall n,s,t_1,\ldots,t_n\), \(\{X(t_1),\ldots,X(t_n)\) and \(X(t_1+s),\ldots,X(t_n+s)\) have the same joint distribution.
- a process is stationary if, in choosing any fixed point s as the origin, the ensuing process has the same probability law.
- An ergodic continuous-time Markov chain \(\{X(t),t\geq 0\}\) when \(\Pr[X(0)=j]=P_j,j\geq 0\), where \(P_j\) are the limiting probabilities.
- a Markov chain whose initial state is chosen according to the limiting probabilities.
- can be regarded as an ergodic Markov chain that we start observing at time \(\infty\). Hence, the continuation of this process at time s after observation begins is just the continuation of the chain starting at time \(\infty+s\), which clearly has the same probability for all s.
- \(X(t)=N(t+L)−N(t),t\geq t_0\),where L>0 is a fixed constant and \(\{N(t),t\geq 0\}\) is a Poisson process having rate \(\lambda\).
- X(t) represents the number of events of a Poisson process that occur between t and t + L
- Poisson process has stationary and independent increments
- \(\{X(t),t\geq 0\}\) is second-order stationary or weakly stationary
- if \(E[X(t)]=c\) and \(Cov[X(t),X(t+s)]\) does not depend on t
- the first two moments of X(t) are the same for all t and the covariance between X(s) and X(t) depends only on |t−s|
- As the finite dimensional distributions of a Gaussian process (being multivariate normal) are determined by their means and covariance, it follows that a second-order stationary Gaussian process is stationary.
- Proposition 10.2 Let \(\{X_n,n\geq 1\}\) be a second-order stationary process with mean \(\mu\) and covariance function \(R(i)=Cov(X_n,X_{n+i})\), and let \(\bar{X}_n=\sum_{i=1}^nX_i/n\)
- \(\lim_{n\to\infty} E[(\bar{X}_n-\mu)^2]=0\) iff \(\lim_{n\to\infty}\sum_{i=1}^n R(i)/n=0\).
- autoregressive process (Ex 10.7, Page 657)
- \(\{Z_n\}\) be uncorrelated rvs with \(E[Z_n]=0, n\geq 0\)
- \(Var(Z_0)=\sigma^2/(1-\lambda^2), Var(Z_n)=\sigma^2, n\geq 1, \lambda^2<1\)
- Define \(X_0=Z_0, X_n=\lambda X_{n-1}+Z_n, n\geq 1\)
- \(\{X_n\}\) is a first-order autoregressive process
- \(X_n=\sum_{i=0}^n\lambda^{n-i}Z_i\)
- \(E(X_n)=0, Cov(X_n,X_{n+m})=\sigma^2\lambda^m/(1-\lambda^2)\)
- \(\{Xn,n\geq 0\}\) is weakly stationary
- moving average process (Ex 10.9, Page 658)
- \(\{W_n\}\) be uncorrelated with \(E[W_n]=\mu,Var[W_n]=\sigma^2,n\geq 0\)
- Define \(X_n=\sum_{i=0}^kW_{n-i}/(k+1), k>0\)
- \(\{X_n,n\geq k\}\) is a moving average process
- arithmetic average of the most recent k+1 values of the Ws
- \(Cov(X_n,X_{n+m})=0,m>k; =(k+1-m)\sigma^2/(k+1)^2, 0\leq m\leq k\).
- \(X_n\) is a second-order stationary process.
Lecture 4
Application of Gaussian process models to regression problems
- First lets consider a linear regression problem (in a mixed model or bayesian setting)
- \(f(X)=X\beta,Y=f(X)+\epsilon, \epsilon\sim N(0,\sigma^2I_n)\)
- data: X (nxp),Y (n observations); parameters: \(\beta\) (p reg parameters)
- assume a zero mean Gaussian prior with covariance matrix \(\Sigma_p\) on \(\beta\sim N(0,\Sigma_p)\)
- inference could be based on the posterior dist of \(\beta\)
- \(\Pr(\beta|Y,X)=\frac{\Pr(\beta)\Pr(Y|\beta,X)}{\Pr(Y|X)}\)
- normalizing constant/marginal likelihood: \(\Pr(Y|X)=\int\Pr(\beta)\Pr(Y|\beta,X)d\beta\)
- \((Y|X)\sim N(0,X\Sigma_pX^T+\sigma^2I_n)\) (using \(E[Y],Var[Y]\))
- easy to check \((\beta|Y,X)\sim N(\sigma^{-2}A^{-1}X^TY,A^{-1}), A=\sigma^{-2}X^TX+\Sigma_p^{-1}\)
- note \(A^{-1}=(\sigma^{-2}X^TX+\Sigma_p^{-1})^{-1}=\Sigma_p-\Sigma_pX^T(X\Sigma_pX^T+\sigma^2I_n)^{-1}X\Sigma_p\)
- for a new observation with \(X_1\), want to predict the response
- \(\Pr[f(X_1)|X,Y]=\int \Pr(f(X_1)|X_1,\beta)\Pr(\beta|Y,X)d\beta\)
- \(f(X_1)=X_1^T\beta\)
- average over all possible parameter values, weighted by their posterior probability
- \(\sim N(\sigma^{-2}X_1^TA^{-1}X^TY,X_1^TA^{-1}X_1)\)
- extend linear model by projecting covariates \(X\) into high dimensional space
- project \(X\) into some high dimensional space using a set of basis functions and then apply the linear model in this space instead of directly on \(X\) themselves.
- assume \(\Phi(X)\): map \(p\)-dim \(X\) into M-dim covariates
- Consider model \(f(X)=\Phi(X)\beta\), \(Y=f(X)+\epsilon\), \(\epsilon\sim N(0,\sigma^2I_n)\)
- data: \(\Phi(X)\) matrix of (nxM)-dim; parameter: \(\beta\in R^M\)
- prior \(\beta\sim N(0,\Sigma_M)\)
- posterior \((\beta|Y,X)\sim N(\sigma^{-2}A^{-1}\Phi(X)^TY,A^{-1}), A=\sigma^{-2}\Phi(X)^T\Phi(X)+\Sigma_M^{-1}\)
- given \(X_1\in R^p\), \(\Phi(X_1)\in R^M\). Consider \(f(X_1)=\Phi(X_1)^T\beta\). \([f(X_1)|X,Y]\) still follows a normal dist with
- mean \(\Phi(X_1)^T\Sigma_M\Phi(X)^T(K+\sigma^2I_n)^{-1}Y\), \(K=\Phi(X)\Sigma_M\Phi(X)^T\)
- variance \(\Phi(X_1)^T\Sigma_M\Phi(X_1)-\Phi(X_1)^T\Sigma_M\Phi(X)^T(K+\sigma^2I_n)^{-1}\Phi(X)\Sigma_M\Phi(X_1)\)
- \(\Leftarrow \sigma^{-2}A^{-1}\Phi(X)^T=\Sigma_M\Phi(X)^T(K+\sigma^2I_n)^{-1}\) (multiply both sides by \(A\) to prove mean equivalence)
- covariance matrix equivalence derived from the matrix inversion lemma (Woodbury, Sherman & Morrison formula)
- \((Z+UWV^T)^{-1}=Z^{-1}-Z^{-1}U(W^{-1}+V^TZ^{-1}U)^{-1}V^TZ^{-1}\)
- Z (nxn); W (mxm); U,V (nxm) (work well for \(m<n\))
- similar results for determinant: \(|Z+UWV^T|=|Z||W||W^{-1}+V^TZ^{-1}U|\)
- when \(U=V=I\), \((A^{-1}+B^{-1})^{-1}=A-A(A+B)^{-1}A=B-B(A+B)^{-1}B\)
- taking \(Z=\Sigma_M^{-1},W=\sigma^{-2}I_n,V=U=\Phi(X)^T\), \(\Rightarrow\) variance equivalence
- all posterior quantities are determined by \(\Phi(X_1)^T\Sigma_M\Phi(X)^T\), \(\Phi(X)\Sigma_M\Phi(X)^T\) or \(\Phi(X_1)^T\Sigma_M\Phi(X_1)\)
- they are all quadratic functions and can be treated as some type of inner product measure (wrt \(\Sigma_M\))
- define \(k(X,X_1)=\Phi(X)^T\Sigma_M\Phi(X_1)\) (covariance function or kernel)
- Let \(\Sigma_M^{1/2}\) be the square root of \(\Sigma_M\), then \(k(X,X_1)=[\Sigma_M^{1/2}\Phi(X)]^T[\Sigma_M^{1/2}\Phi(X_1)]\)
- when an algorithm is defined solely in terms of inner products in \(X\), then it can be extended by replacing those inner products by \(k(X,X_1)\);
- this is sometimes called the kernel trick.
- it is particularly valuable in situations where it is more convenient to compute the kernel than the extended vectors themselves.
- We can treat a Gaussian process as a collection of random variables, any finite number of which have a joint Gaussian distribution
- Consider a Gaussian Process (GP) \(f(x)\) with mean function \(m(x)\) and covariance function \(k(x,x_1)\). Here \(x\in R^p\) (x can be treated as time index).
- \(E[f(x)]=m(x), Cov[f(x),f(x_1)]=E[(f(x)-m(x))(f(x_1)-m(x_1))]=k(x,x_1)\)
- \(f(x)\sim GP(m(x),k(x,x_1))\)
- random variables represent the value of the function \(f(x)\) at location \(x\)
- Consider a linear reg model with normal prior: \(f(x)=\Phi(x)^T\beta, \beta\sim N(0,\Sigma_M)\)
- \(E[f(x)]=\Phi(x)^TE(\beta)=0, Cov[f(x),f(x_1)]=\Phi(x)^TE[\beta\beta^T]\Phi(x_1)=\Phi(x)^T\Sigma_M\Phi(x_1)\)
- \([f(x),f(x_1)]\) follow bivariate normal dist \(\forall x, x_1\).
- \(\forall x_1,\ldots,x_n\), \([f(x_1),\ldots,f(x_n)]\) jointly follow multivariate normal dist (MVN).
- could be a degenerate/singular dist when \(n>M\) (M is the number of basis, i.e., dim of \(\Phi(x)\))
- \(f(x)\) can be treated as a Gaussian process with mean \(m(x)=0\) and covariance \(k(x,x_1)=\Phi(x)^T\Sigma_M\Phi(x_1)\)
- Could be extended by using a more general covariance function k(,)
- for every positive definite covariance function k(,), there exists a (possibly infinite) expansion of basis functions
- e.g., \(k(x,x_1)=\exp(-\|x-x_1\|^2/2)\) (squared exponential kernel)
- for any set of basis functions, we can compute the corresponding covariance function as \(k(x,x_1)=\Phi(x)^T\Sigma\Phi(x_1)\)
- k(,) implies a distribution over functions f(x).
- at any number of points f(x) is a MVN dist with the corresponding covariance matrix k(,)
- Consider GP \(f(x)\) with mean 0 and covariance function \(k(,)\)
- observed data \(X\) (nxp) and some new data \(X_1\) (mxp), jointly \([f(X),f(X_1)]\) has MVN dist
- mean 0 and covariance \(Var[f(X)]=k(X,X),Var[f(X_1)]=k(X_1,X_1), Cov[f(X),f(X_1)]=k(X,X_1)\)
- \(f(X)\in R^n, f(X_1)\in R^m\), \(k(X,X)\) (nxn), \(k(X,X_1)\) (nxm)
- conditionally \([f(X_1)|f(X),X,X_1]\) has MVN dist
- mean \(k(X_1,X)k(X,X)^{-1}f(X)\), covariance \(k(X_1,X_1)-k(X_1,X)k(X,X)^{-1}k(X,X_1)\)
- Assume \(y=f(x)+\epsilon, \epsilon\sim N(0,\sigma^2)\)
- assume independent \(\epsilon\) for different \(x\)
- \(y\) is also a GP with mean 0 and covariance function \(K(x,x)=k(x,x)+\sigma^2\), \(K(x,x_1)=k(x,x_1), \forall x\neq x_1\)
- Given observed \(Y\) (and \(X\)), for new \(X_{\ast}\)
- \([Y,f(X_{\ast})]\) has MVN dist with mean 0 and covariance: \(Var[Y]=k(X,X)+\sigma^2I_n, Cov[Y,f(X_{\ast})]=k(X,X_{\ast}), Var[f(X_{\ast})]=k(X_{\ast},X_{\ast})\)
- \(Y\sim N(0,k(X,X)+\sigma^2I_n)\)
- conditionally \([f(X_{\ast})|Y]\) is a MVN with
- mean \(k(X_{\ast},X)[k(X,X)+\sigma^2I_n]^{-1}Y\) (linear function of Y)
- variance \(k(X_{\ast},X_{\ast})-k(X_{\ast},X)[k(X,X)+\sigma^2I_n]^{-1}k(X,X_{\ast})\)
- for single obs \(X_{\ast}\), conditional mean is \(\sum_{i=1}^n\alpha_ik(X_{\ast},X_i), \alpha=[k(X,X)+\sigma^2I_n]^{-1}Y\)
- a linear combination of n kernel functions
- GP with fixed mean function
- \(f(x)\sim GP(m(x),k(x,x))\)
- \(m(x)\) given fixed function
- conditional mean: \(m(X_{\ast})+k(X_{\ast},X)[k(X,X)+\sigma^2I_n]^{-1}[Y-m(X)]\)
Lecture 5
Application of Gaussian process models to classification with logistic regression
- Consider a linear logistic regression model with gaussian prior
- \(n\) observations: \(\mbox{logit}[\Pr(y_i=1|x_i,\beta)]=x_i^T\beta\), \(i=1,\ldots,n\), \(\beta\sim N(0,\Sigma)\), \(x_i\) (mx1), \(Y=(y_1,\ldots,y_n)^T\)
- assume conditional independent logit model
- marginal prob \(\Pr(y_i|x_i)=\int\Pr(y_i|x_i,\beta)\phi(\beta,\Sigma)d\beta\), where \(\phi(\beta,\Sigma)=(2\pi)^{-m/2}|\Sigma|^{-1/2}\exp(-\beta^2\Sigma^{-1}\beta/2)\)
- joint marginal prob \(\Pr(Y|X)=\int[\prod_{i=1}^n\Pr(y_i|x_i,\beta)]\phi(\beta,\Sigma)d\beta\)
- conditional/posterior prob \(\Pr(\beta|Y,X)=[\prod_{i=1}^n\Pr(y_i|x_i,\beta)]\phi(\beta,\Sigma)/\Pr(Y|X)\)
- For \(K\) new observations with covariates \(X_1\) (Kxm)
- classification incorporating training obs: \(\Pr(Y_1|X_1,Y,X)=\int\Pr(Y_1|X_1,\beta)\Pr(\beta|Y,X)d\beta=\frac{\int\Pr(Y_1|X_1,\beta)\Pr(Y|X,\beta)\phi(\beta,\Sigma)d\beta}{\int\Pr(Y|X,\beta)\phi(\beta,\Sigma)d\beta}\)
- involving two multi-dim integrations with gaussian kernels
- Extension to logistic regression model with Gaussian Process prior
- \(\mbox{logit}[\Pr(y_i=1|x_i,f_i)]=f_i\), \(i=1,\ldots,n\), \(F=(f_1,\ldots,f_n)^T\sim N(0,\Sigma_F)\), \(Y=(y_1,\ldots,y_n)^T\), \(X\) (nxm)
- conditional independence: dist of \(y_i\) is completely determined by \(f_i\); given \(f_i\), all \(y_i\) are independent
- can incorporate additional independent errors to \(y_i|f_i\) (conditional independence still holds)
- GP prior: \(Cov(f_i,f_j)=K(x_i,x_j), E(f_i)=0\)
- joint marginal prob \(\Pr(Y|X)=\int[\prod_{i=1}^n\Pr(y_i|x_i,f_i)]\phi(F,\Sigma_F)dF\) (integration over n-dim GP F)
- For \(K\) new observations with covariates \(X_1\)
- \(\mbox{logit}[\Pr(Y_1=1|X_1,F_1)]=F_1\), \(F_1\sim N(0,\Sigma_1), Cov(f_{1i},f_{1j})=K(x_{1i},x_{1j})\)
- GP prior: \((F,F_1)\sim N(0,\tilde{\Sigma}), Cov(f_i,f_{1j})=K(x_i,x_{1j})\) ((n+K)-dim normal dist)
- \(F_1|F\sim N(A\Sigma_F^{-1}F,\Sigma_1-A\Sigma_F^{-1}A^T), A=Cov(F_1,F)=K(X_1,X)\)
- \(\Pr(Y_1|X_1,Y,X)=\frac{\Pr(Y_1,Y|X,X_1)}{\Pr(Y|X)}=\frac{\int\Pr(Y_1,Y|X,X_1,F,F_1)\phi[(F,F_1),\tilde{\Sigma}]d(F,F_1)}{\int\Pr(Y|X,F)\phi(F,\Sigma_F)dF}\)
- two multi-dim integrations over GP priors
- Consider logistic regression with GP prior
- multi-variate normal prior \(F\sim N(0,\Sigma_F)\)
- covariance matrix is computed based on covariance/kernel function, \(\Sigma_F=K(X,X)\)
- conditional independent logistic regression model
- \(\theta_i=\Pr(y_i=1|f_i)=\frac{1}{1+\exp(-f_i)}, i=1,\ldots,n\)
- In summary, the logistic log likelihood can be written as \(\ell=\log[\Pr(y|f)]=-\log[1+\exp(-(2y-1)f)]=yf-\log[1+\exp(f)]\), \(y\in\{0,1\}\)
- \(\frac{\partial\ell}{\partial f}=y-\theta, \frac{\partial^2\ell}{\partial f^2}=-\theta(1-\theta)\), where \(\theta=\frac{1}{1+\exp(-f)}\)
- quadratic expansion at \(f_0\): \(\ell\approx \ell(f_0)+(y-\theta_0)(f-f_0)-\frac{1}{2}\theta_0(1-\theta_0)(f-f_0)^2\)
- Laplace approx to high-dim integration
- solve \(F_0\): \(\max_F \sum_{i=1}^n\log(y_i|f_i) - \frac{1}{2}F^T\Sigma_F^{-1}F\) (i.e., maximizing the joint or conditional likelihood)
- logistic regression with ridge penalty, solved using, e.g., NR (or IWLS): \(F\Leftarrow F+(\Sigma_F^{-1}+W)^{-1}(Y-\theta-\Sigma^{-1}F)\), \(W=\mbox{diag}\{\theta(1-\theta)\}\)
- NR can be equivalently written as \(F\Leftarrow (\Sigma_F^{-1}+W)^{-1}(Y-\theta+WF)\)
- note \((\Sigma_F^{-1}+W)^{-1}=\Sigma_F-\Sigma_FW^{1/2}(I+W^{1/2}\Sigma_FW^{1/2})^{-1}W^{1/2}\Sigma_F=\Sigma_F-\Sigma_F(W^{-1}+\Sigma_F)^{-1}\Sigma_F\)
- \(\Pr(Y|X)=\int\Pr(Y|X,F)\Pr(F)dF\approx\exp[\ell(F_0)-F_0^T\Sigma_F^{-1}F_0/2]|I+\Sigma_FW_0|^{1/2}\)
- In principle, can incorporate additional fixed effects covariates. It's then iterative parameter estimation and approx Laplace integration.