GLMM estimation with MCMC, and a numerical example

Refs:

Charles McCulloch (1997). Maximum Likelihood Algorithms for Generalized Linear Mixed Models. JASA, 92(437), 162-170. http://www.jstor.org/stable/2291460
Charles McCulloch (1994). Maximum Likelihood Variance Components Estimation for Binary Data. JASA, 89(425), 330-335. http://www.jstor.org/stable/2291229

Probit-normal model for binary data

data: \(Y=(Y_1,\ldots,Y_K)\): K-dim 0-1 vector
\(\Phi^{-1}[\Pr(Y_i=1|u)] = X_i\beta+Z_iu\)
- \(\Phi\) is CDF of \(N(0,1)\) dist.
\(u\): q-dim random effects following \(N(0,R)\)
latent variable model representation
- \(W=X\beta+Zu+\epsilon\), \(\epsilon\sim N(0,I)\)
- \(Y_i=I(W_i>0)\) since \(\Pr(Y_i=1)=\Pr(X_i\beta+Z_iu>-\epsilon_i)=\Phi(X_i\beta+Z_iu)\)
- Marginally \(W\sim N(X\beta,ZRZ^T+I)\)
  - so \(\Pr(Y_1=1) = \Phi(X_i\beta/\sqrt{V_{ii}})\), where \(V=ZRZ^T+I\).
  - The random effects \(u\) does change the marginal mean of outcomes!
EM Estimation for fixed effects parameters \(\beta\)
- complete data \(W\)
- \(\max_{\beta} E[\log\Pr(W)|Y]\)
- we can check that \(\hat{\beta} = (X^TV^{-1}X)^{-1}X^TV^{-1}E(W|Y)\)
EM estimation for \(R\)
- complete data \((W,u)\)
- \(\max_{R}E[\log\Pr(u)|Y]\)
- we can check \(\hat{R} = E(uu^T|Y)\)
- note \(E(uu^T|Y)=E[E(uu^T|W)|Y]\)
- and \((u|W)\) follows a conditional normal dist (mean a linear func of \(W\) and fixed conditonal variance)
  - \(E(u|W) = RZ^TV^{-1}(W-X\beta)\)
  - \(Var(u|W) = R - RZ^TV^{-1}ZR\)
Both estimations reduces to compute \(E(W|Y)\) and \(Var(W|Y)\)
- estimate mean and variance of constrained MVN dist
- easy for one-dim normal dist
- Monte Carlo estimation with Gibbs sampling
  - \((W_i|W_{-i},Y)\) equiv to \((W_i|W_{-i},Y_i)\)
  - note \((W_i|W_{-i})\) follows a normal dist
    - \(E(W_i|W_{-i})=X_i\beta - V_{-i,i}^TV_{-i,-i}^{-1}(W_{-i}-X_{-i}\beta)\)
    - \(Var(W_i|W_{-i})=V_{ii} - V_{-i,i}^TV_{-i,-i}^{-1}V_{-i,i}\)
  - We need to simulate from \([(W_i|W_{-i})|W_i>0]\)
    - this is simulating from an one-dim constrained normal dist.
  - sequentially simulate \(W_i|W_{-i},Y\) for \(i=1,\ldots,K\).

General GLMM

\(\Pr(Y|u)\) independent GLMs
\(u\sim N(0,R)\)
EM algorithm estimation
- E-step: \(Q = E[\log\Pr(Y|u)|Y] + E[\log\Pr(u)|Y]\)
- Monte Carlo approx using \(u\sim \Pr(u|Y)\)
- Markov Chain simulation from \(\Pr(u|Y)\propto \Pr(Y|u)\Pr(u)\)
- Metropolis sampling
  - given current value \(u\)
  - new value \(u'\) from proposal density \(h(u)\)
  - acceptance prob: \(\min(1, \Pr(u'|Y)h(u)/[\Pr(u|Y)h(u')])\)
    - \(=\min(1,\Pr(u')h(u)/[\Pr(u)h(u')]\Pr(Y|u')/\Pr(Y|u))\)
    - take \(h(u)=\Pr(u)\), \(=\min(1,\Pr(Y|u')/\Pr(Y|u))\)
    - to ensure prop acceptance prob, can take \(u'_k=u_k, k\neq i\) and update \(u'_i\) only using \(\Pr(u_i|u_{-i})\) (a conditional normal dist)
- \(Q\approx \sum_b \log\Pr(Y|u_b) + \log\Pr(u_b)\), where \(u_b\sim \Pr(u|Y)\)
  - maximization straightforward: GLM and covariance matrix estimation

A numerical example

First we simulate a longitudinal data with 4 observations for each of 1000 separate individuals. We fit the data with glmer() from lme4 R package.
- three continuous covariates (varying over time)
- one ordinal covariate (constant over time)
- consider a random intercept model (mean zero and variance 100)

n = 1e3; p = 3; K = 4; sig = 10
set.seed(123)
## time varying covariates
Xl = vector('list', K)
for(k in 1:K) Xl[[k]] = matrix(rnorm(n*p), n,p)
## constant covariate
Z = rbinom(n, 2,0.2)
## random effects
U = rnorm(n)*sig
## fixed effects
etaX = sapply(Xl, rowSums)
## random errors
eps = matrix(rnorm(n*K), n,K)
## logit model
eta = etaX + U + eps
prb = 1/(1+exp(-eta))
D = 1*(matrix(runif(n*K),n,K)<prb)
Xs = Xl[[1]]
for(k in 2:K) Xs = rbind(Xs, Xl[[k]])
## GLMM model
library(lme4)
sid = rep(1:n, K)
## model fit with GLMMM (default to Laplace approximation)
a1 = glmer(c(D) ~ Xs + Z[sid] + (1|sid), family=binomial)

We will develop three numerical routines
- Simulate from conditional dist of random effects given data
  - Will use the Metropolis-Hastings (MH) Markov Chain sampling
- Compute proposal dist in MH sampling
  - Will use normal dist matching the mode and hessian (variation) to the conditional dist of random effects
  - Feasible since the unknown normalizing constant does not affect the mode and hessian.
  - Note each observation has its own proposal dist.
    - Need to solve a ridged logistic regression model for each observation
- Parameter estimation
  - Fixed effects parameters: one-step Newton-Raphson (Fisher-scoring) for fixed effects parameters
    - need to compute average of mean/variance across MC samples.
  - Variance of random effects dist: sample variance estimate.

Numerical algorithms

MH sampling algorithm

## MH sampling of random effects | data
## logit\Pr(D_i|eta_i,U) = eta_i+U; U \sim N(0,Vu)
## proposal dist: N(Uc,Vc)
U.mh <- function(Di,eta, Vu, Uc,Vc, B=100){
  ub = rep(0, B)
  ub[1] = rnorm(1)*sqrt(Vc)+Uc
  prb = 1/(1+exp(-eta-ub[1]))
  llk0 = dnorm(ub[1],sd=sqrt(Vu), log=TRUE) + sum(log(Di*prb+(1-Di)*(1-prb))) - dnorm(ub[1],Uc,sqrt(Vc), log=TRUE)
  for(k in 2:B){
    ub[k] = ub[k-1]
    uk = rnorm(1)*sqrt(Vc)+Uc
    prb = 1/(1+exp(-eta-uk))
    llk1 = dnorm(uk,sd=sqrt(Vu), log=TRUE) + sum(log(Di*prb+(1-Di)*(1-prb))) - dnorm(uk,Uc,sqrt(Vc), log=TRUE)
    alpha = exp( llk1 - llk0  )
    if(alpha>=1){
      ub[k] = uk
      llk0 = llk1
    } else{
      aa = runif(1)
      if(aa<alpha){
        ub[k] = uk
        llk0 = llk1
      }
    }
  }
  return(ub)
}

Compute proposal dist
- we did numerical optimization and calculation of second derivatives (using numDeriv R package).

library(numDeriv)
UV.est <- function(Di,eta,Vu,Uc){
  llk0 = function(xpar){
    Uc = xpar
    prb = 1/(1+exp(-eta-Uc))
    res = dnorm(Uc,sd=sqrt(Vu), log=TRUE) + sum(log(Di*prb+(1-Di)*(1-prb)))
    -res
  }
  tmp = try(optim(Uc, llk0, method='Brent', lower=Uc-10,upper=Uc+10) )
  if(class(tmp)=='try-error') tmp = optim(Uc, llk0)
  Uc = tmp$par
  Vc = 1/hessian(llk0, Uc)
  c(Uc,Vc)
}
UV.mh <- function(Vu,beta,Uc, D,X,subj){
  ## Cov matrix
  sid = unique(subj);  n = length(sid)
  Uc = Vc = rep(0,n)
  for(i in 1:n){
    ij = which(subj==sid[i]);  ni = length(ij)
    Xi = X[ij,,drop=FALSE]
    eta = Xi%*%beta
    zi = UV.est(D[ij],eta,Vu,Uc[i])
    Uc[i] = zi[1]; Vc[i] = zi[2]
  }
  return(list(Uc=Uc,Vc=Vc) )
}

Newton Raphson update
- Compute first/second derives of complete data log likelihood

## score and fisher information
SF.mh <- function(Vu,beta,Uc,Vc, D,X,subj){
  ## S/hessian matrix
  sid = unique(subj);  n = length(sid)
  p = dim(X)[2]
  S = rep(0, p)
  FI = matrix(0, p,p)
  sig2 = 0
  for(i in 1:n){
    ij = which(subj==sid[i]);  ni = length(ij)
    Xi = X[ij,,drop=FALSE]
    eta = Xi%*%beta
    zi = U.mh(D[ij],eta,Vu,Uc[i],Vc[i], B=5e3)[-(1:1e3)]
    theta = sapply(eta, function(b0)  mean(1/(1+exp(-b0-zi))) )
    theta2 = sapply(eta, function(b0) mean(exp(b0+zi)/(1+exp(b0+zi))^2) )
    FI = FI + t(Xi)%*%(theta2*Xi)
    S = S+colSums((D[ij]-theta)*Xi)
    sig2 = sig2 + mean(zi^2)
  }
  return(list(S=S, FI=FI, sig2=sig2/n) )
}

Iterative estimation
- Start from the glmer() fitted values
- Note MH sampling is more efficient if we update the proposal dist after each iteration (matching mode/curvature).
- Numerically you can try keeping using the same proposal dist
  - just comment out the two lines of codes for updating Uc/Vc.
  - very slow convergence.

library(lme4)
sid = rep(1:n, K)
a1 = glmer(c(D) ~ Xs + Z[sid] + (1|sid), family=binomial)
## extract variance and fixed effects parameters; + mode/variance of (random effects|data)
Vu = (getME(a1,'theta'))^2; beta = fixef(a1); Um = ranef(a1,condVar=TRUE)
D = c(D); X = cbind(1,Xs,Z[sid]); subj = sid
Uc = unlist(Um[[1]]); Vc = c( attr(Um[[1]], 'postVar') )
for(b in 1:100){
  ## NR updates with MH sampling
  obj = SF.mh(Vu,beta,Uc,Vc, D,X,subj)
  Vu = obj$sig2
  tmp = solve(obj$FI,obj$S)
  beta = beta + tmp
  ## Proposal dist update
  tmp1 = UV.mh(Vu,beta,Uc, D,X,subj)
  Uc = tmp1$Uc; Vc = tmp1$Vc
  cat(b, ':', tmp, ';', obj$S/n, '\n\t', sqrt(Vu), beta, '\n')
}