Gaussian process and Brownian motion

Lecture 1

Brownian motion

Brownian motion: limit of symmetric random walk
- taking smaller and smaller steps in smaller and smaller time intervals
  - each \(\Delta t\) time unit we take a step of size \(\Delta x\) either to the left or the right equal likely
  - let \(\Delta x=\sigma\sqrt{\Delta t}\)
  - \(X(t)=\Delta x(X_1+\ldots+X_{[t/\Delta t]})\): independent \(\{X_i=1,-1 \mbox{ with prob } 1/2\}\)
    - Var[X(t)]=t\(\sigma^2\), E[X(t)]=0
  - let \(\Delta t\to 0\)
    - X(t) is normal with mean 0 and variance \(\sigma^2 t\) (CLT)
    - \(\{X(t),t\geq 0\}\) have independent and stationary increments (random walk property)
A stochastic process \(\{X(t),t\geq 0\}\) is said to be a Brownian motion (Wiener) process if
- X(0) = 0;
- \(\{X(t),t\geq 0\}\) has stationary and independent increments;
- for every t > 0, X(t) is normally distributed with mean 0 and variance \(\sigma^2t\).
  - assuming \(\sigma=1\), the process is called standard Brownian motion
  - density of \(X(t)\), \(f_t(x)=e^{-x^2/2t}/\sqrt{2\pi t}\)
Joint density of \(X(t_1),X(t_2),\ldots,X(t_n)\), \(t_1<\ldots<t_n\)
- independent increments -> \(X(t_1),X(t_2)−X(t_1),\ldots,X(t_n)−X(t_{n−1})\) are independent
- stationary increments -> \(X(t_k)−X(t_{k-1}) \sim N(0,t_k-t_{k-1})\)
- so \(f(x_1,\ldots,x_n)=f_{t_1}(x_1)\prod_{i=2}^nf_{t_i-t_{i-1}}(x_i-x_{i-1})\)
conditional distribution of X(s) given that X(t)=B where s < t
- \(f_{s|t}(x|B)=\frac{f_s(x)f_{t-s}(B-x)}{f_t(B)}\)
- a normal dist with mean \(Bs/t\), and variance \((t-s)s/t\)

Hitting Times, Maximum Variable, and the Gambler's Ruin Problem

\(T_a\): the first time the Brownian motion process hits \(a\)
- When \(a>0\), consider \(\Pr[X(t)\geq a]\)
  - \(\Pr[X(t)\geq a]=\Pr[X(t)\geq a|T_a\leq t]\Pr(T_a\leq t)+\Pr[X(t)\geq a|T_a>t]\Pr(T_a>t)\)
    - \(\Pr[X(t)\geq a|T_a\leq t]=1/2\) (symmetry)
    - \(\Pr[X(t)\geq a|T_a>t]=0\) (continuity)
  - \(=\Pr(T_a\leq t)/2\)
  - So \(\Pr(T_a\leq t)=2\Pr[X(t)\geq a]=2[1-\Phi(a/\sqrt{t})]\)
- When \(a<0\), by symmetry, \(T_a\) has the same dist as \(T_{-a}\). Hence
  - \(\Pr(T_a\leq t)=2\Pr[X(t)\geq |a|]=2\Phi(-|a|/\sqrt{t}), \forall a\)
the maximum value the process attains in [0,t]
- \(\Pr[\max_{s\in[0,t]}X(s)\geq a]=\Pr(T_a\leq t)=2\Phi(-a/\sqrt{t}), a>0\).

Lecture 2

White Noise Transformation

f be a function having a continuous derivative in the region [a, b].
{X(t),t\(\geq\) 0} denote a standard Brownian motion process
Define \(\int_a^b f(t)dX(t)=\lim_{n\to\infty,\max(t_i-t_{i-1})\to 0} \sum_{i=1}^n f(t_{i-1})[X(t_i)-X(t_{i-1})]\)
- \(=f(b)X(b)-f(a)X(a)-\int_a^bX(t)df(t)\) (integration by parts).
- \(E[\int_a^b f(t)dX(t)]=0\)
- \(Var[\int_a^b f(t)dX(t)]=\int_a^bf^2(t)dt\)

Gaussian Processes

A stochastic process X(t), \(t\geq 0\) is called a Gaussian, or a normal, process if X\((t_1),\ldots,X(t_n)\) has a multivariate normal distribution for all \(t_1,\ldots,,t_n\).
Brownian process \(\{X(t),t\geq 0\}\) is Gaussian process.
- For the Brownian motion process, each of \(X(t_1),\ldots,X(t_n)\) can be expressed as a linear combination of the independent normal random variables \(X(t_1),X(t_2)−X(t_1),\ldots,X(t_n)−X(t_{n−1})\),
- it is then a Gaussian process.
- \(E[X(t)]=0, Var[X(t)]=t\)
- \(\forall s\lt t, Cov[X(s),X(t)]=Cov[X(s),X(s)]+Cov[X(s),X(t)−X(s)]=s\) (independent increments)
Brownian bridge (conditional Brownian motion)
- Given a standard Brownian process \(\{X(t),t\geq 0\}\)
- \(\{X(t), 0\leq t\leq 1|X(1)=0\}\) is a Brownian bridge, also a Gaussian process
  - Conditional multivariate normal dist is still multivariate normal
  - \(E[X(s)|X(1)=0]=0, \forall s<1\)
  - \(Cov[(X(s),X(t))|X(1)=0]\)
    - \(=E[X(s)X(t)|X(1)=0]=E[E[X(s)X(t)|X(t),X(1)=0]|X(1)=0]=E[X(t)E[X(s)|X(t)]|X(1)=0]\)
    - \(=E[X(t)(s/t)X(t)|X(1)=0]=(s/t)E[X^2(t)|X(1)=0]\)
    - \(=t(1-t)s/t=s(1-t)\)
Proposition 10.1 If \(\{X(t),t\geq 0\}\) is standard Brownian motion, then \(Z(t)=X(t)-tX(1), 0\leq t\leq 1\) is a Brownian bridge process.
- Z(t) is a Gaussian process
- easy to check \(E[Z(t)]=0\) and \(Cov[Z(s),Z(t)]=s(1−t)\), when \(s\leq t\)
integrated Brownian motion
- Given a Brownian motion \(\{X(t),t\geq 0\}\)
- \(Z(t)=\int_0^t X(s)ds\) is called an integrated Brownian motion
  - \(E[Z(t)]=0\) (since E[X(s)]=0)
  - \(Cov[Z(s),Z(t)]=s^2(t/2-s/6)\).

Lecture 3

Stationary and Weakly Stationary Processes

A stochastic process \(\{X(t),t\geq 0\}\) is said to be a stationary process if \(\forall n,s,t_1,\ldots,t_n\), \(\{X(t_1),\ldots,X(t_n)\) and \(X(t_1+s),\ldots,X(t_n+s)\) have the same joint distribution.
- a process is stationary if, in choosing any fixed point s as the origin, the ensuing process has the same probability law.
- An ergodic continuous-time Markov chain \(\{X(t),t\geq 0\}\) when \(\Pr[X(0)=j]=P_j,j\geq 0\), where \(P_j\) are the limiting probabilities.
  - a Markov chain whose initial state is chosen according to the limiting probabilities.
  - can be regarded as an ergodic Markov chain that we start observing at time \(\infty\). Hence, the continuation of this process at time s after observation begins is just the continuation of the chain starting at time \(\infty+s\), which clearly has the same probability for all s.
- \(X(t)=N(t+L)−N(t),t\geq t_0\),where L>0 is a fixed constant and \(\{N(t),t\geq 0\}\) is a Poisson process having rate \(\lambda\).
  - X(t) represents the number of events of a Poisson process that occur between t and t + L
  - Poisson process has stationary and independent increments
\(\{X(t),t\geq 0\}\) is second-order stationary or weakly stationary
- if \(E[X(t)]=c\) and \(Cov[X(t),X(t+s)]\) does not depend on t
  - the first two moments of X(t) are the same for all t and the covariance between X(s) and X(t) depends only on |t−s|
  - As the finite dimensional distributions of a Gaussian process (being multivariate normal) are determined by their means and covariance, it follows that a second-order stationary Gaussian process is stationary.
- Proposition 10.2 Let \(\{X_n,n\geq 1\}\) be a second-order stationary process with mean \(\mu\) and covariance function \(R(i)=Cov(X_n,X_{n+i})\), and let \(\bar{X}_n=\sum_{i=1}^nX_i/n\)
  - \(\lim_{n\to\infty} E[(\bar{X}_n-\mu)^2]=0\) iff \(\lim_{n\to\infty}\sum_{i=1}^n R(i)/n=0\).

autoregressive process (Ex 10.7, Page 657)
- \(\{Z_n\}\) be uncorrelated rvs with \(E[Z_n]=0, n\geq 0\)
  - \(Var(Z_0)=\sigma^2/(1-\lambda^2), Var(Z_n)=\sigma^2, n\geq 1, \lambda^2<1\)
- Define \(X_0=Z_0, X_n=\lambda X_{n-1}+Z_n, n\geq 1\)
  - \(\{X_n\}\) is a first-order autoregressive process
  - \(X_n=\sum_{i=0}^n\lambda^{n-i}Z_i\)
  - \(E(X_n)=0, Cov(X_n,X_{n+m})=\sigma^2\lambda^m/(1-\lambda^2)\)
  - \(\{Xn,n\geq 0\}\) is weakly stationary
moving average process (Ex 10.9, Page 658)
- \(\{W_n\}\) be uncorrelated with \(E[W_n]=\mu,Var[W_n]=\sigma^2,n\geq 0\)
- Define \(X_n=\sum_{i=0}^kW_{n-i}/(k+1), k>0\)
  - \(\{X_n,n\geq k\}\) is a moving average process
    - arithmetic average of the most recent k+1 values of the Ws
    - \(Cov(X_n,X_{n+m})=0,m>k; =(k+1-m)\sigma^2/(k+1)^2, 0\leq m\leq k\).
  - \(X_n\) is a second-order stationary process.

Lecture 4

Application of Gaussian process models to regression problems

First lets consider a linear regression problem (in a mixed model or bayesian setting)
- \(f(X)=X\beta,Y=f(X)+\epsilon, \epsilon\sim N(0,\sigma^2I_n)\)
  - data: X (nxp),Y (n observations); parameters: \(\beta\) (p reg parameters)
  - assume a zero mean Gaussian prior with covariance matrix \(\Sigma_p\) on \(\beta\sim N(0,\Sigma_p)\)
- inference could be based on the posterior dist of \(\beta\)
  - \(\Pr(\beta|Y,X)=\frac{\Pr(\beta)\Pr(Y|\beta,X)}{\Pr(Y|X)}\)
    - normalizing constant/marginal likelihood: \(\Pr(Y|X)=\int\Pr(\beta)\Pr(Y|\beta,X)d\beta\)
    - \((Y|X)\sim N(0,X\Sigma_pX^T+\sigma^2I_n)\) (using \(E[Y],Var[Y]\))
  - easy to check \((\beta|Y,X)\sim N(\sigma^{-2}A^{-1}X^TY,A^{-1}), A=\sigma^{-2}X^TX+\Sigma_p^{-1}\)
    - note \(A^{-1}=(\sigma^{-2}X^TX+\Sigma_p^{-1})^{-1}=\Sigma_p-\Sigma_pX^T(X\Sigma_pX^T+\sigma^2I_n)^{-1}X\Sigma_p\)
- for a new observation with \(X_1\), want to predict the response
  - \(\Pr[f(X_1)|X,Y]=\int \Pr(f(X_1)|X_1,\beta)\Pr(\beta|Y,X)d\beta\)
    - \(f(X_1)=X_1^T\beta\)
    - average over all possible parameter values, weighted by their posterior probability
    - \(\sim N(\sigma^{-2}X_1^TA^{-1}X^TY,X_1^TA^{-1}X_1)\)
- extend linear model by projecting covariates \(X\) into high dimensional space
  - project \(X\) into some high dimensional space using a set of basis functions and then apply the linear model in this space instead of directly on \(X\) themselves.
  - assume \(\Phi(X)\): map \(p\)-dim \(X\) into M-dim covariates
  - Consider model \(f(X)=\Phi(X)\beta\), \(Y=f(X)+\epsilon\), \(\epsilon\sim N(0,\sigma^2I_n)\)
    - data: \(\Phi(X)\) matrix of (nxM)-dim; parameter: \(\beta\in R^M\)
    - prior \(\beta\sim N(0,\Sigma_M)\)
    - posterior \((\beta|Y,X)\sim N(\sigma^{-2}A^{-1}\Phi(X)^TY,A^{-1}), A=\sigma^{-2}\Phi(X)^T\Phi(X)+\Sigma_M^{-1}\)
    - given \(X_1\in R^p\), \(\Phi(X_1)\in R^M\). Consider \(f(X_1)=\Phi(X_1)^T\beta\). \([f(X_1)|X,Y]\) still follows a normal dist with
      - mean \(\Phi(X_1)^T\Sigma_M\Phi(X)^T(K+\sigma^2I_n)^{-1}Y\), \(K=\Phi(X)\Sigma_M\Phi(X)^T\)
      - variance \(\Phi(X_1)^T\Sigma_M\Phi(X_1)-\Phi(X_1)^T\Sigma_M\Phi(X)^T(K+\sigma^2I_n)^{-1}\Phi(X)\Sigma_M\Phi(X_1)\)
      - \(\Leftarrow \sigma^{-2}A^{-1}\Phi(X)^T=\Sigma_M\Phi(X)^T(K+\sigma^2I_n)^{-1}\) (multiply both sides by \(A\) to prove mean equivalence)
      - covariance matrix equivalence derived from the matrix inversion lemma (Woodbury, Sherman & Morrison formula)
        
        \((Z+UWV^T)^{-1}=Z^{-1}-Z^{-1}U(W^{-1}+V^TZ^{-1}U)^{-1}V^TZ^{-1}\)
        
        Z (nxn); W (mxm); U,V (nxm) (work well for \(m<n\))
        
        similar results for determinant: \(|Z+UWV^T|=|Z||W||W^{-1}+V^TZ^{-1}U|\)
        
        when \(U=V=I\), \((A^{-1}+B^{-1})^{-1}=A-A(A+B)^{-1}A=B-B(A+B)^{-1}B\)
        
        taking \(Z=\Sigma_M^{-1},W=\sigma^{-2}I_n,V=U=\Phi(X)^T\), \(\Rightarrow\) variance equivalence
    - all posterior quantities are determined by \(\Phi(X_1)^T\Sigma_M\Phi(X)^T\), \(\Phi(X)\Sigma_M\Phi(X)^T\) or \(\Phi(X_1)^T\Sigma_M\Phi(X_1)\)
      - they are all quadratic functions and can be treated as some type of inner product measure (wrt \(\Sigma_M\))
      - define \(k(X,X_1)=\Phi(X)^T\Sigma_M\Phi(X_1)\) (covariance function or kernel)
        
        Let \(\Sigma_M^{1/2}\) be the square root of \(\Sigma_M\), then \(k(X,X_1)=[\Sigma_M^{1/2}\Phi(X)]^T[\Sigma_M^{1/2}\Phi(X_1)]\)
      - when an algorithm is defined solely in terms of inner products in \(X\), then it can be extended by replacing those inner products by \(k(X,X_1)\);
        
        this is sometimes called the kernel trick.
        
        it is particularly valuable in situations where it is more convenient to compute the kernel than the extended vectors themselves.
We can treat a Gaussian process as a collection of random variables, any finite number of which have a joint Gaussian distribution
- Consider a Gaussian Process (GP) \(f(x)\) with mean function \(m(x)\) and covariance function \(k(x,x_1)\). Here \(x\in R^p\) (x can be treated as time index).
  - \(E[f(x)]=m(x), Cov[f(x),f(x_1)]=E[(f(x)-m(x))(f(x_1)-m(x_1))]=k(x,x_1)\)
  - \(f(x)\sim GP(m(x),k(x,x_1))\)
  - random variables represent the value of the function \(f(x)\) at location \(x\)
- Consider a linear reg model with normal prior: \(f(x)=\Phi(x)^T\beta, \beta\sim N(0,\Sigma_M)\)
  - \(E[f(x)]=\Phi(x)^TE(\beta)=0, Cov[f(x),f(x_1)]=\Phi(x)^TE[\beta\beta^T]\Phi(x_1)=\Phi(x)^T\Sigma_M\Phi(x_1)\)
    - \([f(x),f(x_1)]\) follow bivariate normal dist \(\forall x, x_1\).
  - \(\forall x_1,\ldots,x_n\), \([f(x_1),\ldots,f(x_n)]\) jointly follow multivariate normal dist (MVN).
    - could be a degenerate/singular dist when \(n>M\) (M is the number of basis, i.e., dim of \(\Phi(x)\))
    - \(f(x)\) can be treated as a Gaussian process with mean \(m(x)=0\) and covariance \(k(x,x_1)=\Phi(x)^T\Sigma_M\Phi(x_1)\)
- Could be extended by using a more general covariance function k(,)
  - for every positive definite covariance function k(,), there exists a (possibly infinite) expansion of basis functions
    - e.g., \(k(x,x_1)=\exp(-\|x-x_1\|^2/2)\) (squared exponential kernel)
  - for any set of basis functions, we can compute the corresponding covariance function as \(k(x,x_1)=\Phi(x)^T\Sigma\Phi(x_1)\)
  - k(,) implies a distribution over functions f(x).
    - at any number of points f(x) is a MVN dist with the corresponding covariance matrix k(,)
- Consider GP \(f(x)\) with mean 0 and covariance function \(k(,)\)
  - observed data \(X\) (nxp) and some new data \(X_1\) (mxp), jointly \([f(X),f(X_1)]\) has MVN dist
    - mean 0 and covariance \(Var[f(X)]=k(X,X),Var[f(X_1)]=k(X_1,X_1), Cov[f(X),f(X_1)]=k(X,X_1)\)
    - \(f(X)\in R^n, f(X_1)\in R^m\), \(k(X,X)\) (nxn), \(k(X,X_1)\) (nxm)
  - conditionally \([f(X_1)|f(X),X,X_1]\) has MVN dist
    - mean \(k(X_1,X)k(X,X)^{-1}f(X)\), covariance \(k(X_1,X_1)-k(X_1,X)k(X,X)^{-1}k(X,X_1)\)
  - Assume \(y=f(x)+\epsilon, \epsilon\sim N(0,\sigma^2)\)
    - assume independent \(\epsilon\) for different \(x\)
    - \(y\) is also a GP with mean 0 and covariance function \(K(x,x)=k(x,x)+\sigma^2\), \(K(x,x_1)=k(x,x_1), \forall x\neq x_1\)
    - Given observed \(Y\) (and \(X\)), for new \(X_{\ast}\)
      - \([Y,f(X_{\ast})]\) has MVN dist with mean 0 and covariance: \(Var[Y]=k(X,X)+\sigma^2I_n, Cov[Y,f(X_{\ast})]=k(X,X_{\ast}), Var[f(X_{\ast})]=k(X_{\ast},X_{\ast})\)
      - \(Y\sim N(0,k(X,X)+\sigma^2I_n)\)
      - conditionally \([f(X_{\ast})|Y]\) is a MVN with
        
        mean \(k(X_{\ast},X)[k(X,X)+\sigma^2I_n]^{-1}Y\) (linear function of Y)
        
        variance \(k(X_{\ast},X_{\ast})-k(X_{\ast},X)[k(X,X)+\sigma^2I_n]^{-1}k(X,X_{\ast})\)
      - for single obs \(X_{\ast}\), conditional mean is \(\sum_{i=1}^n\alpha_ik(X_{\ast},X_i), \alpha=[k(X,X)+\sigma^2I_n]^{-1}Y\)
        
        a linear combination of n kernel functions
- GP with fixed mean function
  - \(f(x)\sim GP(m(x),k(x,x))\)
    - \(m(x)\) given fixed function
  - conditional mean: \(m(X_{\ast})+k(X_{\ast},X)[k(X,X)+\sigma^2I_n]^{-1}[Y-m(X)]\)

Lecture 5

Application of Gaussian process models to classification with logistic regression

Consider a linear logistic regression model with gaussian prior
- \(n\) observations: \(\mbox{logit}[\Pr(y_i=1|x_i,\beta)]=x_i^T\beta\), \(i=1,\ldots,n\), \(\beta\sim N(0,\Sigma)\), \(x_i\) (mx1), \(Y=(y_1,\ldots,y_n)^T\)
  - assume conditional independent logit model
  - marginal prob \(\Pr(y_i|x_i)=\int\Pr(y_i|x_i,\beta)\phi(\beta,\Sigma)d\beta\), where \(\phi(\beta,\Sigma)=(2\pi)^{-m/2}|\Sigma|^{-1/2}\exp(-\beta^2\Sigma^{-1}\beta/2)\)
  - joint marginal prob \(\Pr(Y|X)=\int[\prod_{i=1}^n\Pr(y_i|x_i,\beta)]\phi(\beta,\Sigma)d\beta\)
  - conditional/posterior prob \(\Pr(\beta|Y,X)=[\prod_{i=1}^n\Pr(y_i|x_i,\beta)]\phi(\beta,\Sigma)/\Pr(Y|X)\)
- For \(K\) new observations with covariates \(X_1\) (Kxm)
  - classification incorporating training obs: \(\Pr(Y_1|X_1,Y,X)=\int\Pr(Y_1|X_1,\beta)\Pr(\beta|Y,X)d\beta=\frac{\int\Pr(Y_1|X_1,\beta)\Pr(Y|X,\beta)\phi(\beta,\Sigma)d\beta}{\int\Pr(Y|X,\beta)\phi(\beta,\Sigma)d\beta}\)
    - involving two multi-dim integrations with gaussian kernels
Extension to logistic regression model with Gaussian Process prior
- \(\mbox{logit}[\Pr(y_i=1|x_i,f_i)]=f_i\), \(i=1,\ldots,n\), \(F=(f_1,\ldots,f_n)^T\sim N(0,\Sigma_F)\), \(Y=(y_1,\ldots,y_n)^T\), \(X\) (nxm)
  - conditional independence: dist of \(y_i\) is completely determined by \(f_i\); given \(f_i\), all \(y_i\) are independent
    - can incorporate additional independent errors to \(y_i|f_i\) (conditional independence still holds)
  - GP prior: \(Cov(f_i,f_j)=K(x_i,x_j), E(f_i)=0\)
  - joint marginal prob \(\Pr(Y|X)=\int[\prod_{i=1}^n\Pr(y_i|x_i,f_i)]\phi(F,\Sigma_F)dF\) (integration over n-dim GP F)
- For \(K\) new observations with covariates \(X_1\)
  - \(\mbox{logit}[\Pr(Y_1=1|X_1,F_1)]=F_1\), \(F_1\sim N(0,\Sigma_1), Cov(f_{1i},f_{1j})=K(x_{1i},x_{1j})\)
  - GP prior: \((F,F_1)\sim N(0,\tilde{\Sigma}), Cov(f_i,f_{1j})=K(x_i,x_{1j})\) ((n+K)-dim normal dist)
    - \(F_1|F\sim N(A\Sigma_F^{-1}F,\Sigma_1-A\Sigma_F^{-1}A^T), A=Cov(F_1,F)=K(X_1,X)\)
  - \(\Pr(Y_1|X_1,Y,X)=\frac{\Pr(Y_1,Y|X,X_1)}{\Pr(Y|X)}=\frac{\int\Pr(Y_1,Y|X,X_1,F,F_1)\phi[(F,F_1),\tilde{\Sigma}]d(F,F_1)}{\int\Pr(Y|X,F)\phi(F,\Sigma_F)dF}\)
    - two multi-dim integrations over GP priors

Consider logistic regression with GP prior
- multi-variate normal prior \(F\sim N(0,\Sigma_F)\)
  - covariance matrix is computed based on covariance/kernel function, \(\Sigma_F=K(X,X)\)
- conditional independent logistic regression model
  - \(\theta_i=\Pr(y_i=1|f_i)=\frac{1}{1+\exp(-f_i)}, i=1,\ldots,n\)
- In summary, the logistic log likelihood can be written as \(\ell=\log[\Pr(y|f)]=-\log[1+\exp(-(2y-1)f)]=yf-\log[1+\exp(f)]\), \(y\in\{0,1\}\)
  - \(\frac{\partial\ell}{\partial f}=y-\theta, \frac{\partial^2\ell}{\partial f^2}=-\theta(1-\theta)\), where \(\theta=\frac{1}{1+\exp(-f)}\)
  - quadratic expansion at \(f_0\): \(\ell\approx \ell(f_0)+(y-\theta_0)(f-f_0)-\frac{1}{2}\theta_0(1-\theta_0)(f-f_0)^2\)
- Laplace approx to high-dim integration
  - solve \(F_0\): \(\max_F \sum_{i=1}^n\log(y_i|f_i) - \frac{1}{2}F^T\Sigma_F^{-1}F\) (i.e., maximizing the joint or conditional likelihood)
    - logistic regression with ridge penalty, solved using, e.g., NR (or IWLS): \(F\Leftarrow F+(\Sigma_F^{-1}+W)^{-1}(Y-\theta-\Sigma^{-1}F)\), \(W=\mbox{diag}\{\theta(1-\theta)\}\)
      - NR can be equivalently written as \(F\Leftarrow (\Sigma_F^{-1}+W)^{-1}(Y-\theta+WF)\)
    - note \((\Sigma_F^{-1}+W)^{-1}=\Sigma_F-\Sigma_FW^{1/2}(I+W^{1/2}\Sigma_FW^{1/2})^{-1}W^{1/2}\Sigma_F=\Sigma_F-\Sigma_F(W^{-1}+\Sigma_F)^{-1}\Sigma_F\)
  - \(\Pr(Y|X)=\int\Pr(Y|X,F)\Pr(F)dF\approx\exp[\ell(F_0)-F_0^T\Sigma_F^{-1}F_0/2]|I+\Sigma_FW_0|^{1/2}\)
- In principle, can incorporate additional fixed effects covariates. It's then iterative parameter estimation and approx Laplace integration.