Gaussian process and Brownian motion


Lecture 1

Brownian motion

  • Brownian motion: limit of symmetric random walk
    • taking smaller and smaller steps in smaller and smaller time intervals
      • each \(\Delta t\) time unit we take a step of size \(\Delta x\) either to the left or the right equal likely
      • let \(\Delta x=\sigma\sqrt{\Delta t}\)
      • \(X(t)=\Delta x(X_1+\ldots+X_{[t/\Delta t]})\): independent \(\{X_i=1,-1 \mbox{ with prob } 1/2\}\)
        • Var[X(t)]=t\(\sigma^2\), E[X(t)]=0
      • let \(\Delta t\to 0\)
        • X(t) is normal with mean 0 and variance \(\sigma^2 t\) (CLT)
        • \(\{X(t),t\geq 0\}\) have independent and stationary increments (random walk property)
  • A stochastic process \(\{X(t),t\geq 0\}\) is said to be a Brownian motion (Wiener) process if
    • X(0) = 0;
    • \(\{X(t),t\geq 0\}\) has stationary and independent increments;
    • for every t > 0, X(t) is normally distributed with mean 0 and variance \(\sigma^2t\).
      • assuming \(\sigma=1\), the process is called standard Brownian motion
      • density of \(X(t)\), \(f_t(x)=e^{-x^2/2t}/\sqrt{2\pi t}\)
  • Joint density of \(X(t_1),X(t_2),\ldots,X(t_n)\), \(t_1<\ldots<t_n\)
    • independent increments -> \(X(t_1),X(t_2)−X(t_1),\ldots,X(t_n)−X(t_{n−1})\) are independent
    • stationary increments -> \(X(t_k)−X(t_{k-1}) \sim N(0,t_k-t_{k-1})\)
    • so \(f(x_1,\ldots,x_n)=f_{t_1}(x_1)\prod_{i=2}^nf_{t_i-t_{i-1}}(x_i-x_{i-1})\)
  • conditional distribution of X(s) given that X(t)=B where s < t
    • \(f_{s|t}(x|B)=\frac{f_s(x)f_{t-s}(B-x)}{f_t(B)}\)
    • a normal dist with mean \(Bs/t\), and variance \((t-s)s/t\)

Hitting Times, Maximum Variable, and the Gambler's Ruin Problem

  • \(T_a\): the first time the Brownian motion process hits \(a\)
    • When \(a>0\), consider \(\Pr[X(t)\geq a]\)
      • \(\Pr[X(t)\geq a]=\Pr[X(t)\geq a|T_a\leq t]\Pr(T_a\leq t)+\Pr[X(t)\geq a|T_a>t]\Pr(T_a>t)\)
        • \(\Pr[X(t)\geq a|T_a\leq t]=1/2\) (symmetry)
        • \(\Pr[X(t)\geq a|T_a>t]=0\) (continuity)
      • \(=\Pr(T_a\leq t)/2\)
      • So \(\Pr(T_a\leq t)=2\Pr[X(t)\geq a]=2[1-\Phi(a/\sqrt{t})]\)
    • When \(a<0\), by symmetry, \(T_a\) has the same dist as \(T_{-a}\). Hence
      • \(\Pr(T_a\leq t)=2\Pr[X(t)\geq |a|]=2\Phi(-|a|/\sqrt{t}), \forall a\)
  • the maximum value the process attains in [0,t]
    • \(\Pr[\max_{s\in[0,t]}X(s)\geq a]=\Pr(T_a\leq t)=2\Phi(-a/\sqrt{t}), a>0\).

Lecture 2

White Noise Transformation

  • f be a function having a continuous derivative in the region [a, b].
  • {X(t),t\(\geq\) 0} denote a standard Brownian motion process
  • Define \(\int_a^b f(t)dX(t)=\lim_{n\to\infty,\max(t_i-t_{i-1})\to 0} \sum_{i=1}^n f(t_{i-1})[X(t_i)-X(t_{i-1})]\)
    • \(=f(b)X(b)-f(a)X(a)-\int_a^bX(t)df(t)\) (integration by parts).
    • \(E[\int_a^b f(t)dX(t)]=0\)
    • \(Var[\int_a^b f(t)dX(t)]=\int_a^bf^2(t)dt\)

Gaussian Processes

  • A stochastic process X(t), \(t\geq 0\) is called a Gaussian, or a normal, process if X\((t_1),\ldots,X(t_n)\) has a multivariate normal distribution for all \(t_1,\ldots,,t_n\).
  • Brownian process \(\{X(t),t\geq 0\}\) is Gaussian process.
    • For the Brownian motion process, each of \(X(t_1),\ldots,X(t_n)\) can be expressed as a linear combination of the independent normal random variables \(X(t_1),X(t_2)−X(t_1),\ldots,X(t_n)−X(t_{n−1})\),
    • it is then a Gaussian process.
    • \(E[X(t)]=0, Var[X(t)]=t\)
    • \(\forall s\lt t, Cov[X(s),X(t)]=Cov[X(s),X(s)]+Cov[X(s),X(t)−X(s)]=s\) (independent increments)
  • Brownian bridge (conditional Brownian motion)
    • Given a standard Brownian process \(\{X(t),t\geq 0\}\)
    • \(\{X(t), 0\leq t\leq 1|X(1)=0\}\) is a Brownian bridge, also a Gaussian process
      • Conditional multivariate normal dist is still multivariate normal
      • \(E[X(s)|X(1)=0]=0, \forall s<1\)
      • \(Cov[(X(s),X(t))|X(1)=0]\)
        • \(=E[X(s)X(t)|X(1)=0]=E[E[X(s)X(t)|X(t),X(1)=0]|X(1)=0]=E[X(t)E[X(s)|X(t)]|X(1)=0]\)
        • \(=E[X(t)(s/t)X(t)|X(1)=0]=(s/t)E[X^2(t)|X(1)=0]\)
        • \(=t(1-t)s/t=s(1-t)\)
  • Proposition 10.1 If \(\{X(t),t\geq 0\}\) is standard Brownian motion, then \(Z(t)=X(t)-tX(1), 0\leq t\leq 1\) is a Brownian bridge process.
    • Z(t) is a Gaussian process
    • easy to check \(E[Z(t)]=0\) and \(Cov[Z(s),Z(t)]=s(1−t)\), when \(s\leq t\)
  • integrated Brownian motion
    • Given a Brownian motion \(\{X(t),t\geq 0\}\)
    • \(Z(t)=\int_0^t X(s)ds\) is called an integrated Brownian motion
      • \(E[Z(t)]=0\) (since E[X(s)]=0)
      • \(Cov[Z(s),Z(t)]=s^2(t/2-s/6)\).

Lecture 3

Stationary and Weakly Stationary Processes

  • A stochastic process \(\{X(t),t\geq 0\}\) is said to be a stationary process if \(\forall n,s,t_1,\ldots,t_n\), \(\{X(t_1),\ldots,X(t_n)\) and \(X(t_1+s),\ldots,X(t_n+s)\) have the same joint distribution.
    • a process is stationary if, in choosing any fixed point s as the origin, the ensuing process has the same probability law.
    • An ergodic continuous-time Markov chain \(\{X(t),t\geq 0\}\) when \(\Pr[X(0)=j]=P_j,j\geq 0\), where \(P_j\) are the limiting probabilities.
      • a Markov chain whose initial state is chosen according to the limiting probabilities.
      • can be regarded as an ergodic Markov chain that we start observing at time \(\infty\). Hence, the continuation of this process at time s after observation begins is just the continuation of the chain starting at time \(\infty+s\), which clearly has the same probability for all s.
    • \(X(t)=N(t+L)−N(t),t\geq t_0\),where L>0 is a fixed constant and \(\{N(t),t\geq 0\}\) is a Poisson process having rate \(\lambda\).
      • X(t) represents the number of events of a Poisson process that occur between t and t + L
      • Poisson process has stationary and independent increments
  • \(\{X(t),t\geq 0\}\) is second-order stationary or weakly stationary
    • if \(E[X(t)]=c\) and \(Cov[X(t),X(t+s)]\) does not depend on t
      • the first two moments of X(t) are the same for all t and the covariance between X(s) and X(t) depends only on |t−s|
      • As the finite dimensional distributions of a Gaussian process (being multivariate normal) are determined by their means and covariance, it follows that a second-order stationary Gaussian process is stationary.
    • Proposition 10.2 Let \(\{X_n,n\geq 1\}\) be a second-order stationary process with mean \(\mu\) and covariance function \(R(i)=Cov(X_n,X_{n+i})\), and let \(\bar{X}_n=\sum_{i=1}^nX_i/n\)
      • \(\lim_{n\to\infty} E[(\bar{X}_n-\mu)^2]=0\) iff \(\lim_{n\to\infty}\sum_{i=1}^n R(i)/n=0\).
  • autoregressive process (Ex 10.7, Page 657)
    • \(\{Z_n\}\) be uncorrelated rvs with \(E[Z_n]=0, n\geq 0\)
      • \(Var(Z_0)=\sigma^2/(1-\lambda^2), Var(Z_n)=\sigma^2, n\geq 1, \lambda^2<1\)
    • Define \(X_0=Z_0, X_n=\lambda X_{n-1}+Z_n, n\geq 1\)
      • \(\{X_n\}\) is a first-order autoregressive process
      • \(X_n=\sum_{i=0}^n\lambda^{n-i}Z_i\)
      • \(E(X_n)=0, Cov(X_n,X_{n+m})=\sigma^2\lambda^m/(1-\lambda^2)\)
      • \(\{Xn,n\geq 0\}\) is weakly stationary
  • moving average process (Ex 10.9, Page 658)
    • \(\{W_n\}\) be uncorrelated with \(E[W_n]=\mu,Var[W_n]=\sigma^2,n\geq 0\)
    • Define \(X_n=\sum_{i=0}^kW_{n-i}/(k+1), k>0\)
      • \(\{X_n,n\geq k\}\) is a moving average process
        • arithmetic average of the most recent k+1 values of the Ws
        • \(Cov(X_n,X_{n+m})=0,m>k; =(k+1-m)\sigma^2/(k+1)^2, 0\leq m\leq k\).
      • \(X_n\) is a second-order stationary process.

Lecture 4

Application of Gaussian process models to regression problems

  • First lets consider a linear regression problem (in a mixed model or bayesian setting)
    • \(f(X)=X\beta,Y=f(X)+\epsilon, \epsilon\sim N(0,\sigma^2I_n)\)
      • data: X (nxp),Y (n observations); parameters: \(\beta\) (p reg parameters)
      • assume a zero mean Gaussian prior with covariance matrix \(\Sigma_p\) on \(\beta\sim N(0,\Sigma_p)\)
    • inference could be based on the posterior dist of \(\beta\)
      • \(\Pr(\beta|Y,X)=\frac{\Pr(\beta)\Pr(Y|\beta,X)}{\Pr(Y|X)}\)
        • normalizing constant/marginal likelihood: \(\Pr(Y|X)=\int\Pr(\beta)\Pr(Y|\beta,X)d\beta\)
        • \((Y|X)\sim N(0,X\Sigma_pX^T+\sigma^2I_n)\) (using \(E[Y],Var[Y]\))
      • easy to check \((\beta|Y,X)\sim N(\sigma^{-2}A^{-1}X^TY,A^{-1}), A=\sigma^{-2}X^TX+\Sigma_p^{-1}\)
        • note \(A^{-1}=(\sigma^{-2}X^TX+\Sigma_p^{-1})^{-1}=\Sigma_p-\Sigma_pX^T(X\Sigma_pX^T+\sigma^2I_n)^{-1}X\Sigma_p\)
    • for a new observation with \(X_1\), want to predict the response
      • \(\Pr[f(X_1)|X,Y]=\int \Pr(f(X_1)|X_1,\beta)\Pr(\beta|Y,X)d\beta\)
        • \(f(X_1)=X_1^T\beta\)
        • average over all possible parameter values, weighted by their posterior probability
        • \(\sim N(\sigma^{-2}X_1^TA^{-1}X^TY,X_1^TA^{-1}X_1)\)
    • extend linear model by projecting covariates \(X\) into high dimensional space
      • project \(X\) into some high dimensional space using a set of basis functions and then apply the linear model in this space instead of directly on \(X\) themselves.
      • assume \(\Phi(X)\): map \(p\)-dim \(X\) into M-dim covariates
      • Consider model \(f(X)=\Phi(X)\beta\), \(Y=f(X)+\epsilon\), \(\epsilon\sim N(0,\sigma^2I_n)\)
        • data: \(\Phi(X)\) matrix of (nxM)-dim; parameter: \(\beta\in R^M\)
        • prior \(\beta\sim N(0,\Sigma_M)\)
        • posterior \((\beta|Y,X)\sim N(\sigma^{-2}A^{-1}\Phi(X)^TY,A^{-1}), A=\sigma^{-2}\Phi(X)^T\Phi(X)+\Sigma_M^{-1}\)
        • given \(X_1\in R^p\), \(\Phi(X_1)\in R^M\). Consider \(f(X_1)=\Phi(X_1)^T\beta\). \([f(X_1)|X,Y]\) still follows a normal dist with
          • mean \(\Phi(X_1)^T\Sigma_M\Phi(X)^T(K+\sigma^2I_n)^{-1}Y\), \(K=\Phi(X)\Sigma_M\Phi(X)^T\)
          • variance \(\Phi(X_1)^T\Sigma_M\Phi(X_1)-\Phi(X_1)^T\Sigma_M\Phi(X)^T(K+\sigma^2I_n)^{-1}\Phi(X)\Sigma_M\Phi(X_1)\)
          • \(\Leftarrow \sigma^{-2}A^{-1}\Phi(X)^T=\Sigma_M\Phi(X)^T(K+\sigma^2I_n)^{-1}\) (multiply both sides by \(A\) to prove mean equivalence)
          • covariance matrix equivalence derived from the matrix inversion lemma (Woodbury, Sherman & Morrison formula)
            • \((Z+UWV^T)^{-1}=Z^{-1}-Z^{-1}U(W^{-1}+V^TZ^{-1}U)^{-1}V^TZ^{-1}\)
              • Z (nxn); W (mxm); U,V (nxm) (work well for \(m<n\))
              • similar results for determinant: \(|Z+UWV^T|=|Z||W||W^{-1}+V^TZ^{-1}U|\)
              • when \(U=V=I\), \((A^{-1}+B^{-1})^{-1}=A-A(A+B)^{-1}A=B-B(A+B)^{-1}B\)
            • taking \(Z=\Sigma_M^{-1},W=\sigma^{-2}I_n,V=U=\Phi(X)^T\), \(\Rightarrow\) variance equivalence
        • all posterior quantities are determined by \(\Phi(X_1)^T\Sigma_M\Phi(X)^T\), \(\Phi(X)\Sigma_M\Phi(X)^T\) or \(\Phi(X_1)^T\Sigma_M\Phi(X_1)\)
          • they are all quadratic functions and can be treated as some type of inner product measure (wrt \(\Sigma_M\))
          • define \(k(X,X_1)=\Phi(X)^T\Sigma_M\Phi(X_1)\) (covariance function or kernel)
            • Let \(\Sigma_M^{1/2}\) be the square root of \(\Sigma_M\), then \(k(X,X_1)=[\Sigma_M^{1/2}\Phi(X)]^T[\Sigma_M^{1/2}\Phi(X_1)]\)
          • when an algorithm is defined solely in terms of inner products in \(X\), then it can be extended by replacing those inner products by \(k(X,X_1)\);
            • this is sometimes called the kernel trick.
            • it is particularly valuable in situations where it is more convenient to compute the kernel than the extended vectors themselves.
  • We can treat a Gaussian process as a collection of random variables, any finite number of which have a joint Gaussian distribution
    • Consider a Gaussian Process (GP) \(f(x)\) with mean function \(m(x)\) and covariance function \(k(x,x_1)\). Here \(x\in R^p\) (x can be treated as time index).
      • \(E[f(x)]=m(x), Cov[f(x),f(x_1)]=E[(f(x)-m(x))(f(x_1)-m(x_1))]=k(x,x_1)\)
      • \(f(x)\sim GP(m(x),k(x,x_1))\)
      • random variables represent the value of the function \(f(x)\) at location \(x\)
    • Consider a linear reg model with normal prior: \(f(x)=\Phi(x)^T\beta, \beta\sim N(0,\Sigma_M)\)
      • \(E[f(x)]=\Phi(x)^TE(\beta)=0, Cov[f(x),f(x_1)]=\Phi(x)^TE[\beta\beta^T]\Phi(x_1)=\Phi(x)^T\Sigma_M\Phi(x_1)\)
        • \([f(x),f(x_1)]\) follow bivariate normal dist \(\forall x, x_1\).
      • \(\forall x_1,\ldots,x_n\), \([f(x_1),\ldots,f(x_n)]\) jointly follow multivariate normal dist (MVN).
        • could be a degenerate/singular dist when \(n>M\) (M is the number of basis, i.e., dim of \(\Phi(x)\))
        • \(f(x)\) can be treated as a Gaussian process with mean \(m(x)=0\) and covariance \(k(x,x_1)=\Phi(x)^T\Sigma_M\Phi(x_1)\)
    • Could be extended by using a more general covariance function k(,)
      • for every positive definite covariance function k(,), there exists a (possibly infinite) expansion of basis functions
        • e.g., \(k(x,x_1)=\exp(-\|x-x_1\|^2/2)\) (squared exponential kernel)
      • for any set of basis functions, we can compute the corresponding covariance function as \(k(x,x_1)=\Phi(x)^T\Sigma\Phi(x_1)\)
      • k(,) implies a distribution over functions f(x).
        • at any number of points f(x) is a MVN dist with the corresponding covariance matrix k(,)
    • Consider GP \(f(x)\) with mean 0 and covariance function \(k(,)\)
      • observed data \(X\) (nxp) and some new data \(X_1\) (mxp), jointly \([f(X),f(X_1)]\) has MVN dist
        • mean 0 and covariance \(Var[f(X)]=k(X,X),Var[f(X_1)]=k(X_1,X_1), Cov[f(X),f(X_1)]=k(X,X_1)\)
        • \(f(X)\in R^n, f(X_1)\in R^m\), \(k(X,X)\) (nxn), \(k(X,X_1)\) (nxm)
      • conditionally \([f(X_1)|f(X),X,X_1]\) has MVN dist
        • mean \(k(X_1,X)k(X,X)^{-1}f(X)\), covariance \(k(X_1,X_1)-k(X_1,X)k(X,X)^{-1}k(X,X_1)\)
      • Assume \(y=f(x)+\epsilon, \epsilon\sim N(0,\sigma^2)\)
        • assume independent \(\epsilon\) for different \(x\)
        • \(y\) is also a GP with mean 0 and covariance function \(K(x,x)=k(x,x)+\sigma^2\), \(K(x,x_1)=k(x,x_1), \forall x\neq x_1\)
        • Given observed \(Y\) (and \(X\)), for new \(X_{\ast}\)
          • \([Y,f(X_{\ast})]\) has MVN dist with mean 0 and covariance: \(Var[Y]=k(X,X)+\sigma^2I_n, Cov[Y,f(X_{\ast})]=k(X,X_{\ast}), Var[f(X_{\ast})]=k(X_{\ast},X_{\ast})\)
          • \(Y\sim N(0,k(X,X)+\sigma^2I_n)\)
          • conditionally \([f(X_{\ast})|Y]\) is a MVN with
            • mean \(k(X_{\ast},X)[k(X,X)+\sigma^2I_n]^{-1}Y\) (linear function of Y)
            • variance \(k(X_{\ast},X_{\ast})-k(X_{\ast},X)[k(X,X)+\sigma^2I_n]^{-1}k(X,X_{\ast})\)
          • for single obs \(X_{\ast}\), conditional mean is \(\sum_{i=1}^n\alpha_ik(X_{\ast},X_i), \alpha=[k(X,X)+\sigma^2I_n]^{-1}Y\)
            • a linear combination of n kernel functions
    • GP with fixed mean function
      • \(f(x)\sim GP(m(x),k(x,x))\)
        • \(m(x)\) given fixed function
      • conditional mean: \(m(X_{\ast})+k(X_{\ast},X)[k(X,X)+\sigma^2I_n]^{-1}[Y-m(X)]\)

Lecture 5

Application of Gaussian process models to classification with logistic regression

  • Consider a linear logistic regression model with gaussian prior
    • \(n\) observations: \(\mbox{logit}[\Pr(y_i=1|x_i,\beta)]=x_i^T\beta\), \(i=1,\ldots,n\), \(\beta\sim N(0,\Sigma)\), \(x_i\) (mx1), \(Y=(y_1,\ldots,y_n)^T\)
      • assume conditional independent logit model
      • marginal prob \(\Pr(y_i|x_i)=\int\Pr(y_i|x_i,\beta)\phi(\beta,\Sigma)d\beta\), where \(\phi(\beta,\Sigma)=(2\pi)^{-m/2}|\Sigma|^{-1/2}\exp(-\beta^2\Sigma^{-1}\beta/2)\)
      • joint marginal prob \(\Pr(Y|X)=\int[\prod_{i=1}^n\Pr(y_i|x_i,\beta)]\phi(\beta,\Sigma)d\beta\)
      • conditional/posterior prob \(\Pr(\beta|Y,X)=[\prod_{i=1}^n\Pr(y_i|x_i,\beta)]\phi(\beta,\Sigma)/\Pr(Y|X)\)
    • For \(K\) new observations with covariates \(X_1\) (Kxm)
      • classification incorporating training obs: \(\Pr(Y_1|X_1,Y,X)=\int\Pr(Y_1|X_1,\beta)\Pr(\beta|Y,X)d\beta=\frac{\int\Pr(Y_1|X_1,\beta)\Pr(Y|X,\beta)\phi(\beta,\Sigma)d\beta}{\int\Pr(Y|X,\beta)\phi(\beta,\Sigma)d\beta}\)
        • involving two multi-dim integrations with gaussian kernels
  • Extension to logistic regression model with Gaussian Process prior
    • \(\mbox{logit}[\Pr(y_i=1|x_i,f_i)]=f_i\), \(i=1,\ldots,n\), \(F=(f_1,\ldots,f_n)^T\sim N(0,\Sigma_F)\), \(Y=(y_1,\ldots,y_n)^T\), \(X\) (nxm)
      • conditional independence: dist of \(y_i\) is completely determined by \(f_i\); given \(f_i\), all \(y_i\) are independent
        • can incorporate additional independent errors to \(y_i|f_i\) (conditional independence still holds)
      • GP prior: \(Cov(f_i,f_j)=K(x_i,x_j), E(f_i)=0\)
      • joint marginal prob \(\Pr(Y|X)=\int[\prod_{i=1}^n\Pr(y_i|x_i,f_i)]\phi(F,\Sigma_F)dF\) (integration over n-dim GP F)
    • For \(K\) new observations with covariates \(X_1\)
      • \(\mbox{logit}[\Pr(Y_1=1|X_1,F_1)]=F_1\), \(F_1\sim N(0,\Sigma_1), Cov(f_{1i},f_{1j})=K(x_{1i},x_{1j})\)
      • GP prior: \((F,F_1)\sim N(0,\tilde{\Sigma}), Cov(f_i,f_{1j})=K(x_i,x_{1j})\) ((n+K)-dim normal dist)
        • \(F_1|F\sim N(A\Sigma_F^{-1}F,\Sigma_1-A\Sigma_F^{-1}A^T), A=Cov(F_1,F)=K(X_1,X)\)
      • \(\Pr(Y_1|X_1,Y,X)=\frac{\Pr(Y_1,Y|X,X_1)}{\Pr(Y|X)}=\frac{\int\Pr(Y_1,Y|X,X_1,F,F_1)\phi[(F,F_1),\tilde{\Sigma}]d(F,F_1)}{\int\Pr(Y|X,F)\phi(F,\Sigma_F)dF}\)
        • two multi-dim integrations over GP priors
  • Consider logistic regression with GP prior
    • multi-variate normal prior \(F\sim N(0,\Sigma_F)\)
      • covariance matrix is computed based on covariance/kernel function, \(\Sigma_F=K(X,X)\)
    • conditional independent logistic regression model
      • \(\theta_i=\Pr(y_i=1|f_i)=\frac{1}{1+\exp(-f_i)}, i=1,\ldots,n\)
    • In summary, the logistic log likelihood can be written as \(\ell=\log[\Pr(y|f)]=-\log[1+\exp(-(2y-1)f)]=yf-\log[1+\exp(f)]\), \(y\in\{0,1\}\)
      • \(\frac{\partial\ell}{\partial f}=y-\theta, \frac{\partial^2\ell}{\partial f^2}=-\theta(1-\theta)\), where \(\theta=\frac{1}{1+\exp(-f)}\)
      • quadratic expansion at \(f_0\): \(\ell\approx \ell(f_0)+(y-\theta_0)(f-f_0)-\frac{1}{2}\theta_0(1-\theta_0)(f-f_0)^2\)
    • Laplace approx to high-dim integration
      • solve \(F_0\): \(\max_F \sum_{i=1}^n\log(y_i|f_i) - \frac{1}{2}F^T\Sigma_F^{-1}F\) (i.e., maximizing the joint or conditional likelihood)
        • logistic regression with ridge penalty, solved using, e.g., NR (or IWLS): \(F\Leftarrow F+(\Sigma_F^{-1}+W)^{-1}(Y-\theta-\Sigma^{-1}F)\), \(W=\mbox{diag}\{\theta(1-\theta)\}\)
          • NR can be equivalently written as \(F\Leftarrow (\Sigma_F^{-1}+W)^{-1}(Y-\theta+WF)\)
        • note \((\Sigma_F^{-1}+W)^{-1}=\Sigma_F-\Sigma_FW^{1/2}(I+W^{1/2}\Sigma_FW^{1/2})^{-1}W^{1/2}\Sigma_F=\Sigma_F-\Sigma_F(W^{-1}+\Sigma_F)^{-1}\Sigma_F\)
      • \(\Pr(Y|X)=\int\Pr(Y|X,F)\Pr(F)dF\approx\exp[\ell(F_0)-F_0^T\Sigma_F^{-1}F_0/2]|I+\Sigma_FW_0|^{1/2}\)
    • In principle, can incorporate additional fixed effects covariates. It's then iterative parameter estimation and approx Laplace integration.