This document provides the code for a simple simulation study to assess the effect of the number of covariates in the model on the width of the confidence interval for a single coefficient in a logistic regression model. The document is prepared in R Markdown using RStudio. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents.

R Markdown is useful because it provides a more natural and flexible way than commenting code to document what you are doing in simulation studies or data analysis. For example, R Markdown allows you to include LaTeX in the text, so you provide more detailed and helpful documentation as below. For more details on using R Markdown see http://rmarkdown.rstudio.com.


Motivating Example

You need to fit a logistic regression model of the form

\[\log(\theta/(1-\theta)) = \gamma_0 + \gamma_1 X_1 + \cdots + \gamma_p X_p\]

where \(X_1\) is the predictor of interest and you are considering adjusting for \(X_2, \dots, X_p\).

Many adjustment variables are available, and you are concerned that estimating many \(\gamma\) parameters will harm your ability to estimate \(\gamma_1\) precisely.

Question: How is estimation of \(\gamma_1\) affected by number of covariates \(p\) in the regression model?


What is a simulation study?

A simulation study is a computer experiment, usually conducted by Monte Carlo (random) sampling from probability distributions.

Two key features:

  1. You control the truth! \(\Rightarrow\) can explore many possible ``truth’’ scenarios.
  2. Computers don’t get bored! \(\Rightarrow\) can repeat experiment thousands of times to learn about sampling distribution of parameters of interest.

Simulation study steps

  1. Decide what quantities/properties you want to investigate
  2. Decide what ``truth’’ scenarios you want to consider
  3. Write and optimize computer code for running simulations
  4. RUN simulations!… get a coffee… or two…
  5. Collect and summarize results
  6. Document the process

Step 1: What to estimate?

\[\log(\theta/(1-\theta)) = \gamma_0 + \gamma_1 X_1 + \cdots + \gamma_p X_p\]

What properties of \(\hat \gamma_1\) might we be interested in?

  • Bias
  • Variance
  • Confidence interval coverage
  • Size/power of hypothesis tests

Step 2: What to vary?

In general, a simulation study should investigate different scenarios which show how the properties of interest vary.

For this example, we might vary:

  • The number of variables \(p\): 2, 5, 10, 20
  • Distribution of \(X_1, \dots, X_p\): Bernoulli, Independent Normal, MVN
  • The values of \(\gamma_1, \dots, \gamma_p\): all zero, a few non-zero, most non-zero, all non-zero

(for today’s class, just vary the number of covariates \(p\))


Step 3: Write and optimize computer code

General-purpose simulation algorithm:

  • Generate \(S\) independent data sets under the conditions of interest
  • Compute the numerical value of an estimator or test statistic \(T\) for each dataset to obtain \(T_1, T_2, \ldots, T_S\)
  • Compute a summary statistic (e.g. mean, median) from \(T_1, T_2, \ldots, T_S\) to estimate properties of quantity of interest.

Simulation code

R Markdown allows you to include code in the document and view its output, like so:


Setting up the simulation

## Set the initial conditions
Ps <- c(2,5,10,20) ## Number of covariates
N <- 100 ## Total sample size
nsim <- 1000 ## Number of simulations
## Create a master vector of length max(Ps) which we will subset for different scenarios
gamma.all <- c(1,0.5,0.2,rep(0,17)) 

print(gamma.all)
##  [1] 1.0 0.5 0.2 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
## [18] 0.0 0.0 0.0

The doSim function

## Define the simulation function
doSim <- function(seed,N,gamma) {
  set.seed(seed) ## Set the random seed for this simulation
  p <- length(gamma) ## Number of covariates in the model
  
  X <- t(sapply(1:N,function(x) { rnorm(p) }))

  ## Generate the outcome vector from a logistic model
  Y <- rbinom(N,1,prob=exp(X%*%gamma)/(1 + exp(X%*%gamma)))
  
  #### Fit a logistic model
  GLM <- glm(Y~X,family=binomial)
  
  #### Return a vector with the estimated coefficient and CI for the first covariate
  return(c(coef(GLM)[2],coef(GLM)[2]+c(-1,1)*1.96*summary(GLM)$coef[2,2])) ## This has length 3
}

In the doSim function, the covariate matrix \(X\) is created by stacking N random \(N(0,1)\) vectors of length \(p\). The sapply command applies a function to a set. Here, the set is the numbers \(1, 2, \dots, N\). The function returns a vector of random \(N(0,1)\) data of length \(p\) (the covariate values). The x argument is just a placeholder and isn’t used in the function. The resulting matrix is transposed by the t() command to get the right shape

When X is a matrix, the formula Y~X is interpreted as Y~X[,1]+X[,2]+...+X[,p], i.e., a main effect for each column of the matrix. There is no need to define an intercept term, as it is included by default.

coef(GLM) yields a vector of coefficients from the model fit; the intercept is the first element, so we are interested in the second element summary(GLM)$coef returns the table of coefficients, SEs, and p-values; the [2,2] element gives the standard error of interest, and combining this with the coefficient estimate we can compute the CI as below.


Setting the random seed

Note the line set.seed(seed) in the doSim function. This guarantees that, if everything else stays the same, re-running the code will produce the exact same result. This can help you identify when errors are occurring, and also allows other users to verify that your code is giving a consistent answer.


Performing the simulation

## Note the use of 'cache = TRUE' here to speed up compiling of the document. 
## This option runs the code in this block once and stores the results, 
## and uses those results instead of re-running the code unless the content of the block changes.

## Define a summary table to hold the results
table.summ <- matrix(nrow=length(Ps),ncol=4)

for(i in 1:length(Ps)) {
  ## Define the number of covariates
  p <- Ps[i]
  
  ## Define the coefficient vector by subsetting the 'master' vector
  gamma <- gamma.all[1:p]
  
  #### Do the simulation with these settings
  sim.results <- t(sapply(1:nsim,doSim,N=N,gamma=gamma))
  
  ## Summarize the simulation results
  bias <- mean(sim.results[,1]) - gamma[1]
  coverage <- mean( sim.results[,2] < gamma[1] & sim.results[,3] > gamma[1])
  CIwidth <- mean( sim.results[,3] - sim.results[,2])
  
  ## Put the summaries in a table
  table.summ[i,] <- c(p,bias,coverage,CIwidth)
}

The above code actually performs the simulation under the different scenarios.

Note that the line sim.results <- t(sapply(1:nsim,doSim,N=N,gamma=gamma)) is a slightly different syntax for sapply(); here, we have already defined the doSim() function, which takes three arguments: simulation number (used as a random seed), N, and gamma. sapply() feeds the simulation numbers 1, 2, ..., nsim to doSim(), along with additional arguments for the sample size (N) and coefficient vector (gamma) The result of doSim() is a vector of length 3 containing the estimated coefficient and CI. As above, we transpose the resulting matrix. This time, it’s more for convenience.


Writing out the results

Advice on presenting simulation results

Designing clear tables for presenting (non-trivial) simulation results is hard! Some rules of thumb:

  • Simulation settings on rows (no more than 8-10), summary stats on columns (no more than 6-8)
  • Use descriptive names whenever feasible:
    • For columns, ``Bias’’ is preferable to \(E(\hat{\gamma}_1 - \gamma_1)\)
    • For rows, ``Weakly informative prior’’ is better than \(\gamma_1 \sim N(0,0.1)\)
    • (Parameter values can be put in a separate table, possibly in appendix/supplementary materials)
  • Plots are often preferable to tables, always preferable in talk slides.

Producing tables

R Markdown allows you to produce nice tables (that also work in MS Word!). In the kable function, digits specifies the number of digits to print for each column (the first ‘0’ is for the row names, which we actually don’t print out), align gives the alignment of each column, and row.names = NULL tells the command to omit the row names when printing the table.

## Give the columns of the results table friendly names
colnames(table.summ) <- c("p","Bias","Coverage","CI width")

knitr::kable(table.summ, digits = c(0, 0, 3, 2, 2), 
             align = c('l', 'c', 'c', 'c', 'c'), row.names = FALSE)
p Bias Coverage CI width
2 0 0.952 1.11
5 0 0.951 1.15
10 0 0.931 1.25
20 0 0.864 1.56

It’s also possible to get output that can be pasted into LaTeX documents using xtable:

#### Print the table
## Write out a summary table in LaTeX
library(xtable) ## Load a required library
print(xtable(table.summ,digits=c(0,0,3,2,2),align="lcccc"),include.rownames=FALSE)
## % latex table generated in R 3.2.5 by xtable 1.8-2 package
## % Sat Sep 10 12:31:15 2016
## \begin{table}[ht]
## \centering
## \begin{tabular}{cccc}
##   \hline
## p & Bias & Coverage & CI width \\ 
##   \hline
## 2 & 0.067 & 0.95 & 1.11 \\ 
##   5 & 0.083 & 0.95 & 1.15 \\ 
##   10 & 0.182 & 0.93 & 1.25 \\ 
##   20 & 0.424 & 0.86 & 1.56 \\ 
##    \hline
## \end{tabular}
## \end{table}

Another neat feature is that you can use R to calculate and insert values into the text. For example, the values in the following sentence will change if the simulation settings do: the smallest confidence interval width was $1.1092442 and the largest was $1.5616845. You can also get them to change if you comment out the line set.seed(seed).


Producing plots

R Markdown allows you to embed plots. For example:

Note that the code to generate the plot is not shown here; it is in the source document, but the echo = FALSE parameter was added to the code chunk to prevent printing of the R code that generated the plot.


Other thoughts

Dealing with failure(s)

  • Sometimes, simulations will fail (\(p \approx n\) regression scenarios, survival data with no failures if failure is rare, etc.).
  • Specify how you will handle iterations that fail, and keep track of proportion of failed iterations.
  • If failures are relatively uncommon (say \(\approx 5\%\)), can eliminate them from summary statistics and note that you have done so.
  • Otherwise, reasons for failure should be carefully investigated.

How many simulations do you need?

  • Can perform ``sample size calculation’’ to see how many you need for desired precision (see Burton et al. (2006) on course web page).
  • But often, in practice… as many as you can do before the manuscript has to be submitted/revised!

Assignment

Conduct a simple simulation study and summarize the results in a short (\(\leq 1\) page) R Markdown document compiled to HTML, Word, or PDF. You may either select your own topic, or use one of the topics provided below. Assignments should be turned in one week from today, i.e. on September 21.


Assignment option 1

Epidemiologists like to categorize continuous exposures; here you will use simulation to evaluate the effect of categorization on power.

Consider the model \(y_i=\beta_0+\beta_1x_i+\varepsilon_i\), where \(\varepsilon_i \sim N(0,1)\) and \(\beta_0 = 0\). You will generate data for a sample size of \(n=100\) from this model, for various values of \(\beta_1\), when \(x\) is recorded as follows:

  • a linear term in \(x_i\) is used as a predictor (a 1 df test for the \(x_i\) effect).
  • three indicator variables for the \(4^{th}\), \(3^{rd}\), and \(2^{nd}\) quartiles of \(x_i\) are used as predictors (a 3 df test of the \(x_i\) effect).

Report the Type I error for \(H_0: \beta_1 = 0\) vs. \(H_1: \beta_1 \neq 0\) when \(\beta_1=0\) and the power when \(\beta_1 = 0.1, 0.5\), and 1.

Optional challenge: Produce a power curve showing how the power varies with the size of \(\beta_1\) (you may want to simulate at more values of \(\beta_1\) for this).


Assignment option 2

In this exercise, you will investigate the Type I error and power of the t-test (assuming equal variances) under homoscedasticity and heteroscedasticity.

Suppose that you have two groups with values generated from \(N(\mu_1, \sigma^2_1)\) and \(N(\mu_2, \sigma^2_2)\), respectively. Generate data for groups of size \(n=10\) and evaluate the (two-sided) Type I error and power of the (equal variances) t-test for the following scenarios:

  • \(\mu_1 = \mu_2 = 0\), \(\sigma^2_1 = 1, \sigma^2_2 = 1, 5, 25\)
  • \(\mu_1 = 0, \mu_2 = 1, 2, 3\), \(\sigma^2_1 =1, \sigma^2_2 = 1,5,25\)

NOTE: By default, R uses the unequal-variances t-test, so you will have to use t.test(..., var.equal=TRUE) for this exercise.

Optional challenge: Repeat this procedure using the t-test with unequal variances, and compare the results.