This document provides the code for a simple simulation study to assess the effect of the number of covariates in the model on the width of the confidence interval for a single coefficient in a logistic regression model. The document is prepared in R Markdown using RStudio. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents.

R Markdown is useful because it provides a more natural and flexible way than commenting code to document what you are doing in simulation studies or data analysis. For example, R Markdown allows you to include LaTeX in the text, so you provide more detailed and helpful documentation as below. For more details on using R Markdown see http://rmarkdown.rstudio.com.

You need to fit a logistic regression model of the form

\[\log(\theta/(1-\theta)) = \gamma_0 + \gamma_1 X_1 + \cdots + \gamma_p X_p\]

where \(X_1\) is the predictor of interest and you are considering adjusting for \(X_2, \dots, X_p\).

Many adjustment variables are available, and you are concerned that estimating many \(\gamma\) parameters will harm your ability to estimate \(\gamma_1\) precisely.

**Question:** How is estimation of \(\gamma_1\) affected by number of covariates \(p\) in the regression model?

- No obvious theoretical results to rely on (solves special case only or does not apply)
- Solution? Simulate!

A simulation study is a **computer experiment**, usually conducted by **Monte Carlo** (random) **sampling** from probability distributions.

**Two key features:**

- You control the truth! \(\Rightarrow\) can explore many possible ``truth’’ scenarios.
- Computers don’t get bored! \(\Rightarrow\) can repeat experiment thousands of times to learn about sampling distribution of parameters of interest.

- Decide what quantities/properties you want to investigate
- Decide what ``truth’’ scenarios you want to consider
- Write and optimize computer code for running simulations
- RUN simulations!… get a coffee… or two…
- Collect and summarize results
- Document the process

\[\log(\theta/(1-\theta)) = \gamma_0 + \gamma_1 X_1 + \cdots + \gamma_p X_p\]

What properties of \(\hat \gamma_1\) might we be interested in?

- Bias
- Variance
- Confidence interval coverage
- Size/power of hypothesis tests

In general, a simulation study should investigate different scenarios which show how the properties of interest vary.

For this example, we might vary:

- The number of variables \(p\): 2, 5, 10, 20
- Distribution of \(X_1, \dots, X_p\): Bernoulli, Independent Normal, MVN
- The values of \(\gamma_1, \dots, \gamma_p\): all zero, a few non-zero, most non-zero, all non-zero

(for today’s class, just vary the number of covariates \(p\))

General-purpose simulation algorithm:

- Generate \(S\) independent data sets under the conditions of interest
- Compute the numerical value of an estimator or test statistic \(T\) for each dataset to obtain \(T_1, T_2, \ldots, T_S\)
- Compute a summary statistic (e.g. mean, median) from \(T_1, T_2, \ldots, T_S\) to estimate properties of quantity of interest.

R Markdown allows you to include code in the document and view its output, like so:

```
## Set the initial conditions
Ps <- c(2,5,10,20) ## Number of covariates
N <- 100 ## Total sample size
nsim <- 1000 ## Number of simulations
## Create a master vector of length max(Ps) which we will subset for different scenarios
gamma.all <- c(1,0.5,0.2,rep(0,17))
print(gamma.all)
```

```
## [1] 1.0 0.5 0.2 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
## [18] 0.0 0.0 0.0
```

`doSim`

function```
## Define the simulation function
doSim <- function(seed,N,gamma) {
set.seed(seed) ## Set the random seed for this simulation
p <- length(gamma) ## Number of covariates in the model
X <- t(sapply(1:N,function(x) { rnorm(p) }))
## Generate the outcome vector from a logistic model
Y <- rbinom(N,1,prob=exp(X%*%gamma)/(1 + exp(X%*%gamma)))
#### Fit a logistic model
GLM <- glm(Y~X,family=binomial)
#### Return a vector with the estimated coefficient and CI for the first covariate
return(c(coef(GLM)[2],coef(GLM)[2]+c(-1,1)*1.96*summary(GLM)$coef[2,2])) ## This has length 3
}
```

In the `doSim`

function, the covariate matrix \(X\) is created by stacking N random \(N(0,1)\) vectors of length \(p\). The `sapply`

command applies a function to a set. Here, the set is the numbers \(1, 2, \dots, N\). The function returns a vector of random \(N(0,1)\) data of length \(p\) (the covariate values). The `x`

argument is just a placeholder and isn’t used in the function. The resulting matrix is transposed by the `t()`

command to get the right shape

When `X`

is a matrix, the formula `Y~X`

is interpreted as `Y~X[,1]+X[,2]+...+X[,p]`

, i.e., a main effect for each column of the matrix. There is no need to define an intercept term, as it is included by default.

`coef(GLM)`

yields a vector of coefficients from the model fit; the intercept is the first element, so we are interested in the second element `summary(GLM)$coef`

returns the table of coefficients, SEs, and p-values; the [2,2] element gives the standard error of interest, and combining this with the coefficient estimate we can compute the CI as below.

Note the line `set.seed(seed)`

in the `doSim`

function. This guarantees that, if everything else stays the same, re-running the code will produce the exact same result. This can help you identify when errors are occurring, and also allows other users to verify that your code is giving a consistent answer.

```
## Note the use of 'cache = TRUE' here to speed up compiling of the document.
## This option runs the code in this block once and stores the results,
## and uses those results instead of re-running the code unless the content of the block changes.
## Define a summary table to hold the results
table.summ <- matrix(nrow=length(Ps),ncol=4)
for(i in 1:length(Ps)) {
## Define the number of covariates
p <- Ps[i]
## Define the coefficient vector by subsetting the 'master' vector
gamma <- gamma.all[1:p]
#### Do the simulation with these settings
sim.results <- t(sapply(1:nsim,doSim,N=N,gamma=gamma))
## Summarize the simulation results
bias <- mean(sim.results[,1]) - gamma[1]
coverage <- mean( sim.results[,2] < gamma[1] & sim.results[,3] > gamma[1])
CIwidth <- mean( sim.results[,3] - sim.results[,2])
## Put the summaries in a table
table.summ[i,] <- c(p,bias,coverage,CIwidth)
}
```

The above code actually performs the simulation under the different scenarios.

Note that the line `sim.results <- t(sapply(1:nsim,doSim,N=N,gamma=gamma))`

is a slightly different syntax for `sapply()`

; here, we have already defined the `doSim()`

function, which takes three arguments: simulation number (used as a random seed), `N`

, and `gamma`

. `sapply()`

feeds the simulation numbers `1, 2, ..., nsim`

to `doSim()`

, along with additional arguments for the sample size (`N`

) and coefficient vector (`gamma`

) The result of `doSim()`

is a vector of length 3 containing the estimated coefficient and CI. As above, we transpose the resulting matrix. This time, it’s more for convenience.

Designing **clear** tables for presenting (non-trivial) simulation results is **hard!** Some rules of thumb:

- Simulation settings on rows (no more than 8-10), summary stats on columns (no more than 6-8)
- Use descriptive names whenever feasible:
- For columns, ``Bias’’ is preferable to \(E(\hat{\gamma}_1 - \gamma_1)\)
- For rows, ``Weakly informative prior’’ is better than \(\gamma_1 \sim N(0,0.1)\)
- (Parameter values can be put in a separate table, possibly in appendix/supplementary materials)

- Plots are often preferable to tables,
**always preferable in talk slides**.

R Markdown allows you to produce nice tables (that also work in MS Word!). In the `kable`

function, `digits`

specifies the number of digits to print for each column (the first ‘0’ is for the row names, which we actually don’t print out), `align`

gives the alignment of each column, and `row.names = NULL`

tells the command to omit the row names when printing the table.

```
## Give the columns of the results table friendly names
colnames(table.summ) <- c("p","Bias","Coverage","CI width")
knitr::kable(table.summ, digits = c(0, 0, 3, 2, 2),
align = c('l', 'c', 'c', 'c', 'c'), row.names = FALSE)
```

p | Bias | Coverage | CI width |
---|---|---|---|

2 | 0 | 0.952 | 1.11 |

5 | 0 | 0.951 | 1.15 |

10 | 0 | 0.931 | 1.25 |

20 | 0 | 0.864 | 1.56 |

It’s also possible to get output that can be pasted into LaTeX documents using `xtable`

:

```
#### Print the table
## Write out a summary table in LaTeX
library(xtable) ## Load a required library
print(xtable(table.summ,digits=c(0,0,3,2,2),align="lcccc"),include.rownames=FALSE)
```

```
## % latex table generated in R 3.2.5 by xtable 1.8-2 package
## % Sat Sep 10 12:31:15 2016
## \begin{table}[ht]
## \centering
## \begin{tabular}{cccc}
## \hline
## p & Bias & Coverage & CI width \\
## \hline
## 2 & 0.067 & 0.95 & 1.11 \\
## 5 & 0.083 & 0.95 & 1.15 \\
## 10 & 0.182 & 0.93 & 1.25 \\
## 20 & 0.424 & 0.86 & 1.56 \\
## \hline
## \end{tabular}
## \end{table}
```

Another neat feature is that you can use R to calculate and insert values into the text. For example, the values in the following sentence will change if the simulation settings do: the smallest confidence interval width was $1.1092442 and the largest was $1.5616845. You can also get them to change if you comment out the line `set.seed(seed)`

.

R Markdown allows you to embed plots. For example:

Note that the code to generate the plot is not shown here; it is in the source document, but the `echo = FALSE`

parameter was added to the code chunk to prevent printing of the R code that generated the plot.

- Sometimes, simulations will fail (\(p \approx n\) regression scenarios, survival data with no failures if failure is rare, etc.).
- Specify how you will handle iterations that fail, and keep track of proportion of failed iterations.
- If failures are relatively uncommon (say \(\approx 5\%\)), can eliminate them from summary statistics and note that you have done so.
- Otherwise, reasons for failure should be carefully investigated.

- Can perform ``sample size calculation’’ to see how many you need for desired precision (see Burton et al. (2006) on course web page).
- But often, in practice… as many as you can do before the manuscript has to be submitted/revised!

Conduct a simple simulation study and summarize the results in a short (\(\leq 1\) page) R Markdown document compiled to HTML, Word, or PDF. You may either select your own topic, or use one of the topics provided below. Assignments should be turned in one week from today, i.e. on **September 21**.

Epidemiologists like to categorize continuous exposures; here you will use simulation to evaluate the effect of categorization on power.

Consider the model \(y_i=\beta_0+\beta_1x_i+\varepsilon_i\), where \(\varepsilon_i \sim N(0,1)\) and \(\beta_0 = 0\). You will generate data for a sample size of \(n=100\) from this model, for various values of \(\beta_1\), when \(x\) is recorded as follows:

- a linear term in \(x_i\) is used as a predictor (a 1 df test for the \(x_i\) effect).
- three indicator variables for the \(4^{th}\), \(3^{rd}\), and \(2^{nd}\) quartiles of \(x_i\) are used as predictors (a 3 df test of the \(x_i\) effect).

Report the Type I error for \(H_0: \beta_1 = 0\) vs. \(H_1: \beta_1 \neq 0\) when \(\beta_1=0\) and the power when \(\beta_1 = 0.1, 0.5\), and 1.

**Optional challenge:** Produce a *power curve* showing how the power varies with the size of \(\beta_1\) (you may want to simulate at more values of \(\beta_1\) for this).

In this exercise, you will investigate the Type I error and power of the t-test (assuming equal variances) under homoscedasticity and heteroscedasticity.

Suppose that you have two groups with values generated from \(N(\mu_1, \sigma^2_1)\) and \(N(\mu_2, \sigma^2_2)\), respectively. Generate data for groups of size \(n=10\) and evaluate the (two-sided) Type I error and power of the (equal variances) t-test for the following scenarios:

- \(\mu_1 = \mu_2 = 0\), \(\sigma^2_1 = 1, \sigma^2_2 = 1, 5, 25\)
- \(\mu_1 = 0, \mu_2 = 1, 2, 3\), \(\sigma^2_1 =1, \sigma^2_2 = 1,5,25\)

NOTE: By default, R uses the **unequal-variances** t-test, so you will have to use `t.test(..., var.equal=TRUE)`

for this exercise.

**Optional challenge:** Repeat this procedure using the t-test with unequal variances, and compare the results.