On testing genetic association with multiple traits

R package development

We developed a R package MSKAT for testing the rare variant set association with multiple phenotypes.
- The current implementation requires the davies function in the CompQuadForm R package.
  - http://cran.r-project.org/web/packages/CompQuadForm/index.html
- A variation of the multivariate kernel machine regression (MKMR) is adapted to rare variant set association test.
  - Ref: Maity,A., Sullivan,P.F., Tzeng,J. (2012) Multivariate Phenotype Association Analysis by Marker-Set Kernel Machine Regression. Genetic Epidemiology, 36, 686-695.
  - MKMR was orginally developed for common variants and relied on resampling to compute significance p-values.
  - We wrap the authors' posted R codes (http://www4.stat.ncsu.edu/~maity/software/Multivariate_Response_KMTest.txt) into a R function 'MKMR()', but add the (default) option to more efficiently compute analytical p-values using the Davies method instead of the original resampling approach.
- Two multivariate versions of SKAT are implemented.
  - Ref: Wu,B., Pankow,J.S. (2016). Sequence kernel association test of multiple continuous phenotypes. Genetic Epidemiology, 40(2), 91-100.
- The Gene Association with Multiple Traits (GAMuT) is also implemented.
  - Ref: Broadaway, K.A., Cutler, D.J., Duncan, R., Moore, J.L., Ware, E.B., Jhun, M.A., Bielak, L.F., Zhao, W., Smith, J.A., Peyser, P.A., Kardia, S.L.R., Ghosh, D., Epstein, M.P. (2016). A Statistical Approach for Testing Cross-Phenotype Effects of Rare Variants. The American Journal of Human Genetics, 98, 525–540.
  - We wrap the authors' posted R codes (http://genetics.emory.edu/labs/epstein/software/gamut) into a R function 'gamut()'.
  - We also develop an alternative implementation wrapped into a R function 'fgamut()' that is several orders of magnitude faster than gamut().
MSKAT R packages
- Unix
- MAC
- Windows

Several comments

1. MSKAT vs GAMuT

GAMuT with the project matrix kernel roughly corresponds to the \(Q\) statistic in MSKAT.
GAMuT with the linear kernel roughly corresponds to the \(Q'\) statistic in MSKAT.

2. Computing the p-value for variant set association test

Reduce to calculating the tail probability for weighted sum of 1-DF chi-square distributions.
- Default computation approaches implemented in GAMuT do not have accurate results at small significance levels.
- GAMuT has very liberal type I errors at extreme significance levels.
We adopt a numerical approach to accurately computing the analytical p-values for variant set association test.
- The implementation requires the davies function in the CompQuadForm R package.
  - http://cran.r-project.org/web/packages/CompQuadForm/index.html
- main idea: using the Davies method with more stringent convergence criterion. Otherwise it could lead to very liberal results.
- Ref: Wu,B., Guan,W., Pankow,J.S. (2016) On efficient and accurate calculation of significance p-values for sequence kernel association test of variant set. Annals of human genetics, 80(2), 123-135.

3. Testing general (non-continuous) outcomes

Using arguments of CLT, we can roughly treat general outcomes as continuous and apply the multi-trait testing methods. Then the test statistics and their 1-DF chi-square null distributions asymptotically hold.
The accuracy of asymptotic approximation may not be very accurate for finite sample sizes. For example, for binary data
- The Bernoulli variance is \(\theta(1-\theta)\) where \(\theta\) is event probabilities.
- In GAMuT, we approximate Bernoulli variance with constant normal variance.
- For binary events with a narrow range of \(\theta\), GAMuT approximation could work well.
- Binary data with extreme event probabilities \(\theta\) could show a wide range of variations. Type I error inflation could occur.
Generally we need to be cautious when applying the method to non-continuous outcomes.
- More computing intensive resampling/permutation p-values should be computed for verification.

4. Effects of adjusting for covariates

The modeling framework of MSKAT is a more unified approach to naturally accounting/adjusting for (especially ancestry) covariates.
As a result, MSKAT could have better finite sample performance than GAMuT: it appropriately accounts for the estimation of covariates when computing the asymptotic distributions.
- GAMuT typically uses unadjusted genotype covariance matrix.
- MSKAT appropriately uses (covariate) adjusted genotype covariance matrix, which is generally smaller than the unadjusted covariance matrix (i.e., smaller eigen-values).
- So MSKAT will have smaller p-values than GAMuT under population stratification.
  - MSKAT offers more improvements under strong population differences.
Computationally MSKAT is much more efficient than GAMuT and scales very well to genome and exome-wide analyses.

Simulate rare variant set

Simulate variants in a selected gene region based on the multivariate normal distribution (MVN) with compound symmetry covariance matrix \(\Sigma\).
- Assume marginal variance is one and the pairwise correlation is \(\rho\): \(\Sigma_{jj}=1, \Sigma_{kj}=\rho, \forall k\neq j\).
- Assume marginal MAFs \((\theta_1,\ldots,\theta_m)\) for the \(m\) variants.
  - compute quantiles \(q_j=\Phi^{-1}(\theta_j)\).
  - simulate \(Z_1\sim N(0,\Sigma)\) and \(Z_2\sim N(0,\Sigma)\) (independently).
    - \(Z_1=(z_{11},\ldots,z_{1m}), Z_2=(z_{21},\ldots,z_{2m})\)
  - construct the j-th variant genotype as \(g_j = I(z_{1j}\lt q_j)+I(z_{2j}\lt q_j)\).
- Typical approach is to applying SVD on \(\Sigma\) to compute \(\Sigma^{0.5}\), which is then multiplied by independent standard normal random variables to simulate \(Z\) from \(N(0,\Sigma)\).
  - not scalable to large variant set and sample size.
We implement a R function to very quickly simulate variants as follows.
- Simulate iid \((x_1,\ldots,x_m,x_{m+1})\) from standard normal \(N(0,1)\).
- Set \(z_j=\sqrt{1-\rho}x_j+\sqrt{\rho}x_{m+1}\), and threshold them at the quantiles.
- We can easily verify that \(var(z_j)=1, cov(z_j,z_l)=\rho, \forall j\neq l\).
We can easily compute LD for any given variant MAFs.
- Generally for rare variants with small MAFs, the LD is small even for relatively large \(\rho\).
- For example, for \(\rho=0.5\), and \(\theta_j=0.01\), we compute LD as follows.
  - let \(q_1=\Phi^{-1}(0.01)=-2.326\), compute \(\Pr(z_1\leq q_1, z_2\leq q_1)=0.00129\) (using the mvtnorm R package)
  - so \(corr[I(z_1\leq q_1), I(z_2\leq q_1)] = (0.00129-0.01^2)/0.01/0.99=0.121\), and hence LD=0.015.
- With MAF=0.01, we need to have \(\rho=0.755\) to reach pairwise LD=0.1.
- For convenience, we implement a R function to compute pairwise LD based on the input \(\rho\) and MAFs \((\theta_j,\theta_l)\).

## Simulate variant set
## m number of variants
## rho  compound symmetry correlation
## N population size to be simulated
## maf0,maf1 lower and upper bound for the MAF
## return an Nxm genotype matrix (assuming HWE)
SimVS <- function(m=16, rho=0.5, N=2e5, maf0=0.001,maf1=0.01){
## G sim
  MAFs = runif(m, maf0,maf1)
  q1 = qnorm(MAFs)
  Z1 = matrix(rnorm(N*m),N,m)*sqrt(1-rho) + rnorm(N)*sqrt(rho)
  Z2 = matrix(rnorm(N*m),N,m)*sqrt(1-rho) + rnorm(N)*sqrt(rho)
  G = t( 1*(t(Z1)<q1) + 1*(t(Z2)<q1) )
  return(G)
}
## compute pairwise LD
## rho  compound symmetry correlation
## maf  MAFs of the two variants
## return the computed LD (squared correlation)
library(mvtnorm)
LD <- function(rho=0.5,maf=c(0.01,0.005)){
  q = qnorm(maf)
  sigs = matrix(rho,2,2); diag(sigs) = 1
  pr = pmvnorm(rep(-Inf,2),q, sigma=sigs)
  a1 = (pr-maf[1]*maf[2])/sqrt(maf[1]*(1-maf[1]))/sqrt(maf[2]*(1-maf[2]))
  return(a1^2)
}
## some example
rho = seq(0,0.95, length=20)
LDs = rep(0,20)
for(i in 1:20){
  LDs[i] = LD(rho[i], rep(0.01,2))
}
plot(rho, LDs, type='b', xlab=expression(rho), ylab='LD')
## reverse solving rho for a given LD
uniroot(function(x) LD(x,rep(0.01,2))-0.1, c(0,0.95))$root

Simulation study

On comment 1 and 2

Consider 1000 unrelated individuals with three outcomes: two continuous and one binary
- Two covariates
  - one binary indicator \(X_1\) with event probability of 0.5.
  - one continuous covariate \(X_2\) simulated from the standard normal dist.
- Traits simulated as follows
  - simulate three continuous traits \((Z_1,Z_2,Z_3)\) from MVN
    - means equal to \(1+0.5(X_1+X_2)\)
    - marginal unit variance and pairwise correlation \(\gamma=0.25\)
  - the last binary trait generated as follows
    - given \(Z_3-1-0.5(X_1+X_2)=\epsilon\sim N(0,1)\), let \(p=\Phi(\epsilon), \eta = \log\frac{p}{1-p}\)
    - define \(D=1\) if \(\eta+0.5(X_1+X_2) >0\) and \(D=0\) otherwise.
    - can easily verify the marginal logistic model: \(\mbox{logit}\Pr(D=1)=0.5(X_1+X_2)\).
We simulate a set of 40 variants using the previous approach.
- MAFs \(\{\theta_j\}\) uniformly simulated from interval \((0.001,0.01)\)
- MVN correlation \(\theta=0.5\)
- Variants are equally weighted (i.e., Beta dist based weights with two parameters equal to 1).

Simulation results

Conduct \(4\times 10^8\) null simulations to get stable results
Consider significance levels \(\alpha=10^{-2},\ldots,10^{-6},10^{-7},5\times 10^{-8}\).
Results reported as the inflation ratio, defined as the empirical type I errors divided by the nominal significance level \(\alpha\)
- value larger than 1 means inflated type I errors.

Compare three sets of methods

Original GAMuT implementations with linear and projection kernels.
"Corrected GAMuT": the same GAMuT test statistics, the only difference is the modified p-value calculation by using the numerical algorithm of Wu et al. (2016).
MSKAT statistics Q and Q'.

\(\alpha\)	10^-2	10^-3	10^-4	10^-5	10^-6	10^-7	5x10^-8
GAMuT Proj	0.81	0.70	0.60	0.43	0.47	4.69	9.38
GAMuT Linear	0.81	0.71	0.62	0.48	0.32	3.13	6.26
GAMuT Proj Corrected	0.81	0.70	0.61	0.52	0.39	0.20	0.15
GAMuT Linear Corrected	0.81	0.71	0.63	0.56	0.47	0.20	0.20
MSKAT Q	0.80	0.69	0.59	0.51	0.39	0.17	0.15
MSKAT Q'	0.80	0.70	0.62	0.55	0.46	0.20	0.20

Some conclusions
- GAMuT has severely inflated type I errors for small significance levels of 10^-7 and 5x10^-8.
- MSKAT always properly controlled the type I errors.
- "Corrected GAMuT" has very similar results as MSKAT.
  - linear/projection kernels correspond to Q' and Q respectively.
- "Corrected GAMuT" vs original GAMuT
  - very similar results at large \(\alpha\) levels (10^-2,10^-3,10^-4)
  - the original GAMuT has conservative p-values at \(\alpha=10^{-5}\), but severely inflated p-values at \(\alpha=10^{-7},5\times 10^{-8}\).
- The following simulation results for GAMuT are based on the "Corrected GAMuT".

On comment 3

Consider \(n\) unrelated individuals with four outcomes: two continuous and two binary
- Two covariates
  - one binary indicator \(X_1\) with event probability of 0.5.
  - one positive covariate \(X_2\) simulated from the log normal dist.
- Traits simulated as follows
  - simulate four continuous traits from MVN
    - means equal to \(1+X_1+X_2\)
    - marginal unit variance and pairwise correlation \(\gamma=0.25\)
  - dichotomize the last two traits at their median to produce the binary traits.
- Simulate a set of 40 variants using the previous approach.
  - MAFs \(\{\theta_j\}\) uniformly simulated from interval \((0.001,0.01)\)
  - MVN correlation \(\theta=0.5\)

Simulation results for GAMuT

Conduct \(4\times 10^7\) null simulations for \(n=1000,2000,5000\)
Consider significance levels \(\alpha=10^{-2},\ldots,10^{-5},2.5\times 10^{-6}\).
Results reported as the inflation ratio (empirical type I errors divided by the nominal significance level \(\alpha\))
- value larger than 1 means inflated type I errors.

Results for n=1000

\(\alpha\)	10^-2	10^-3	10^-4	10^-5	2.5x10^-6	10^-6
GAMuT Proj	0.95	1.12	1.91	4.76	8.81	14.2
GAMuT Linear	0.94	1.21	2.31	6.03	11.5	17.9
MSKAT Q	0.94	1.15	2.13	5.97	12.7	21.6
MSKAT Q'	0.94	1.25	2.58	7.58	15.7	26.9

Results for n=2000

\(\alpha\)	10^-2	10^-3	10^-4	10^-5	2.5x10^-6	10^-6
GAMuT Proj	1.02	1.21	1.97	5.01	10.3	16.7
GAMuT Linear	1.02	1.30	2.32	6.10	12.2	20.1
MSKAT Q	1.01	1.22	2.05	5.51	11.8	19.5
MSKAT Q'	1.02	1.31	2.41	6.69	13.6	23.2

Results for n=5000

\(\alpha\)	10^-2	10^-3	10^-4	10^-5	2.5x10^-6	10^-6
GAMuT Proj	1.04	1.17	1.63	3.39	6.52	10.1
GAMuT Linear	1.05	1.23	1.83	4.01	7.37	11.3
MSKAT Q	1.04	1.17	1.64	3.47	6.74	10.5
MSKAT Q'	1.05	1.23	1.85	4.10	7.69	11.9

Conclusions
- GAMuT only controlled type I errors well at the 0.01 level.
- GAMuT has severely inflated type I errors at small \(\alpha\) levels.
  - inflation generally got worse at smaller significance levels.
- For n=1000 and n=2000, we got comparable inflations.
- With large sample size n=5000, type I errors were still severely inflated (though reduced).
  - might need really huge sample sizes for the CLT to kick in to produce accurate p-values.
  - worthwhile to check/verify small p-values by resampling/permutations.
- Inflation is mainly due to the two binary traits (non-constant variances).

On comment 4

Consider 1000 unrelated individuals with four continuous outcomes
- Two covariates
  - one binary indicator \(X_1\) with event probability of 0.5.
  - one continuous covariate \(X_2\) simulated from the standard normal dist.
- Traits simulated as four continuous traits from MVN
  - marginal unit variance and pairwise correlation \(\gamma=0.25\)
  - means equal to \(1+X_1+X_2+\sum_{j\in A}\beta_jg_j\)
    - \(g_j\) is the genotype for variant \(j\), \(\beta_j\) is the effect size, and \(A\) is the set of causal variants.
    - causal variant set A could be different across traits.
- Simulate a set of \(m\) variants using the previous approach.
  - MVN correlation \(\theta=0.5\)
Power simulation 1
- set MAFs as follows
  - simulate \((\theta_1,\ldots,\theta_m)\) uniformly from interval \((0.001,0.005)\).
  - set MAF as \(\theta_j+0.01X_1\) for variant j.
  - here we have population stratification.
  - variants equally weighted.
- Conduct \(10^5\) simulations for \(m=20,40,80\).
- For each trait, we randomly select 4 variants as causal with effect size \(\beta_j=0.5\).
- Consider significance levels \(\alpha=10^{-5},2.5\times 10^{-6},10^{-6},10^{-7}\).
- Results for \(m=80\)
  
  \(\alpha\) 10^-5 2.5x10^-6 10^-6 10^-7
  
  Q 0.446 0.345 0.286 0.168
  
  GAMuT Proj 0.406 0.306 0.250 0.141
  
  Q' 0.544 0.438 0.375 0.237
  
  GAMuT Linear 0.509 0.403 0.339 0.206
- Results for \(m=40\)
  
  \(\alpha\) 10^-5 2.5x10^-6 10^-6 10^-7
  
  Q 0.590 0.485 0.419 0.273
  
  GAMuT Proj 0.556 0.448 0.382 0.239
  
  Q' 0.649 0.546 0.479 0.325
  
  GAMuT Linear 0.618 0.512 0.444 0.291
- Results for \(m=20\)
  
  \(\alpha\) 10^-5 2.5x10^-6 10^-6 10^-7
  
  Q 0.755 0.668 0.607 0.454
  
  GAMuT Proj 0.734 0.641 0.577 0.420
  
  Q' 0.780 0.693 0.633 0.474
  
  GAMuT Linear 0.760 0.668 0.605 0.442
Power simulation 2
- set MAFs as follows
  - simulate \((\theta_1,\ldots,\theta_m)\) uniformly from interval \((0.001,0.002)\).
  - set MAF as \(\theta_j+0.02X_1\) for variant j.
  - here we have stronger population stratification.
  - variants equally weighted.
- Conduct \(10^5\) simulations for \(m=20,40,80\).
- For each trait, we randomly select 4 variants as causal with effect size \(\beta_j=0.4\).
- Consider significance levels \(\alpha=10^{-5},2.5\times 10^{-6},10^{-6},10^{-7}\).
- Results for \(m=80\)
  
  \(\alpha\) 10^-5 2.5x10^-6 10^-6 10^-7
  
  Q 0.457 0.354 0.293 0.172
  
  GAMuT Proj 0.359 0.261 0.206 0.105
  
  Q' 0.596 0.491 0.423 0.268
  
  GAMuT Linear 0.507 0.397 0.329 0.193
- Results for \(m=40\)
  
  \(\alpha\) 10^-5 2.5x10^-6 10^-6 10^-7
  
  Q 0.574 0.462 0.393 0.252
  
  GAMuT Proj 0.475 0.359 0.300 0.174
  
  Q' 0.678 0.573 0.504 0.343
  
  GAMuT Linear 0.604 0.484 0.410 0.256
- Results for \(m=20\)
  
  \(\alpha\) 10^-5 2.5x10^-6 10^-6 10^-7
  
  Q 0.732 0.635 0.573 0.409
  
  GAMuT Proj 0.657 0.556 0.485 0.320
  
  Q' 0.799 0.707 0.642 0.485
  
  GAMuT Linear 0.740 0.635 0.566 0.397
- Conclusions
  - MSKAT generally has larger power than GMAuT.
    - more so under stronger population differences and for larger variant sets (in terms of percentage of increase).
  - This is because GMAuT uses an unadjusted genotype covariance, while MSKAT properly accounts for adjusting for covariate effects and uses an adjusted genotype covariance matrix.

\(\alpha\)	10^-5	2.5x10^-6	10^-6	10^-7
Q	0.446	0.345	0.286	0.168
GAMuT Proj	0.406	0.306	0.250	0.141
Q'	0.544	0.438	0.375	0.237
GAMuT Linear	0.509	0.403	0.339	0.206

\(\alpha\)	10^-5	2.5x10^-6	10^-6	10^-7
Q	0.590	0.485	0.419	0.273
GAMuT Proj	0.556	0.448	0.382	0.239
Q'	0.649	0.546	0.479	0.325
GAMuT Linear	0.618	0.512	0.444	0.291

\(\alpha\)	10^-5	2.5x10^-6	10^-6	10^-7
Q	0.755	0.668	0.607	0.454
GAMuT Proj	0.734	0.641	0.577	0.420
Q'	0.780	0.693	0.633	0.474
GAMuT Linear	0.760	0.668	0.605	0.442

\(\alpha\)	10^-5	2.5x10^-6	10^-6	10^-7
Q	0.457	0.354	0.293	0.172
GAMuT Proj	0.359	0.261	0.206	0.105
Q'	0.596	0.491	0.423	0.268
GAMuT Linear	0.507	0.397	0.329	0.193

\(\alpha\)	10^-5	2.5x10^-6	10^-6	10^-7
Q	0.574	0.462	0.393	0.252
GAMuT Proj	0.475	0.359	0.300	0.174
Q'	0.678	0.573	0.504	0.343
GAMuT Linear	0.604	0.484	0.410	0.256

\(\alpha\)	10^-5	2.5x10^-6	10^-6	10^-7
Q	0.732	0.635	0.573	0.409
GAMuT Proj	0.657	0.556	0.485	0.320
Q'	0.799	0.707	0.642	0.485
GAMuT Linear	0.740	0.635	0.566	0.397

Simulation R codes

On comment 1,2: GAMuT vs MSKAT

Following the approach of MSKAT (Wu and Pankow, 2016) using the multiple linear regression modeling based score test, we developed a very fast implementation of GAMuT, called fgamut(), which will reproduce the p-values for the GAMuT with linear and projection kernels and also output the corresponding corrected p-values using the approach of Wu et al. (2016).
We verify that fgamut() reproduces the results from gamut() (adapted from authors' posted R codes).
- Sometimes we see very small differences: due to numerical rounding differences of using eigen() and svd() to compute eigen values.
- Generally fgamut() is several orders of magnitude faster than gamut().
  - especially so for large sample size and variant set.
- The gamut() occasionally gives aberrant results for very small p-values. fgamut() produces much more accurate results than gamut() at small p-values.
- MSKAT generally produces smaller p-values than (corrected) GAMuT under population stratification.

library(CompQuadForm)
library(MSKAT)
n=5e3; K=4; p = 4; m = 100
X = matrix(rnorm(n*p), n,p); Z = rbinom(n,1,0.5); XZ = cbind(X,Z)
## simulate genotype from MVN
rho = 0.5; maf1 = 0.005; maf2 = 0.015
q1 = qnorm(maf1); q2 = qnorm(maf2)
qz = q1 + Z*(q2-q1)
G1 = matrix(rnorm(n*m), n,m)*sqrt(1-rho) + rnorm(n)*sqrt(rho)
G2 = matrix(rnorm(n*m), n,m)*sqrt(1-rho) + rnorm(n)*sqrt(rho)
G = 1*(G1<qz) + 1*(G2<qz)
yrho = 0.25
Y = matrix(rnorm(n*K), n,K)*sqrt(1-yrho) + rnorm(n)*sqrt(yrho)
Y[,1] = Y[,1] + rowSums(XZ) - rowSums(G[,1:3])*0.5 + G[,m]*0.7
Y[,2] = Y[,2] + rowSums(XZ) + rowSums(G[,c(1,4,5)])*0.5 - G[,m-1]*0.6
## weighted test
MSKAT(MSKAT.cnull(Y,XZ), G, W.beta=c(1,25))
fgamut(Y,G,XZ, W.beta=c(1,25))
gamut(Y,G,XZ, W.beta=c(1,25))
## unweighted test
MSKAT(MSKAT.cnull(Y,XZ), G, W.beta=c(1,1))
fgamut(Y,G,XZ, W.beta=c(1,1))
gamut(Y,G,XZ, W.beta=c(1,1))
## timing
system.time( gamut(Y,G,XZ, W.beta=c(1,25)) )
system.time( fgamut(Y,G,XZ, W.beta=c(1,25)) )

Simulation function to verify type I errors

library(CompQuadForm)
library(MSKAT)
SimSize2 = function(B=1e3, n=1e3, yrho=0.25, grho=0.5, m=40, MAF0=0.001,MAF1=0.01){
  Pval = matrix(1, B,6)
  for(ib in 1:B){
    Xd = rnorm(n); Xg = rbinom(n,1,0.5); Xs = cbind(Xd,Xg)
    maf.1 = runif(m, MAF0,MAF1)
    qz = qnorm(maf.1)
    ##
    rho1 =sqrt(1-grho);  rho2 = sqrt(grho)
    Z1 = matrix(rnorm(n*m), n,m)*rho1 + rnorm(n)*rho2
    Z2 = matrix(rnorm(n*m), n,m)*rho1 + rnorm(n)*rho2
    G = 1*(Z1<qz) + 1*(Z2<qz)
    ##
    eta = 1+0.5*(Xd+Xg)
    rho1 = sqrt(1-yrho);  rho2 = sqrt(yrho)
    tmp = matrix(rnorm(n*3),n,3)*rho1+rnorm(n)*rho2
    Ys = tmp[,1:2] + eta
    pr = pnorm(tmp[,3])
    Yu = eta-1 + log(pr/(1-pr))
    D = I(Yu>0)
    ##
    Y = cbind(Ys,D)
    pval1 = MSKAT(MSKAT.cnull(Y,Xs), G, W.beta=c(1,1))
    pval2 = fgamut(Y,G,Xs, W.beta=c(1,1))
    Pval[ib,] = c(pval1, pval2)
  }
  return(Pval)
}
res = SimSize2(1e3, n=1e3,yrho=0.25, grho=0.5, m=40)
colMeans(res<1e-2)

On comment 3

Type I error with mixed traits.

MixSize = function(B=1e2, n=1e3, yrho=0.25, grho=0.5, m=40, MAF0=0.001,MAF1=0.01){
  ## sim data
  Pval = matrix(1, B,6)
  for(ib in 1:B){
    Xd = exp(rnorm(n))
    Xg = rbinom(n,1,0.5); Xs = cbind(Xd,Xg)
    maf.1 = runif(m, MAF0,MAF1)
    qz = qnorm(maf.1)
    ##
    rho1 =sqrt(1-grho);  rho2 = sqrt(grho)
    Z1 = matrix(rnorm(n*m), n,m)*rho1 + rnorm(n)*rho2
    Z2 = matrix(rnorm(n*m), n,m)*rho1 + rnorm(n)*rho2
    G = 1*(Z1<qz) + 1*(Z2<qz)
    ## outcome sim
    eta = 1+Xd+Xg
    rho1 = sqrt(1-yrho);  rho2 = sqrt(yrho)
    tmp = matrix(rnorm(n*4),n,4)*rho1+rnorm(n)*rho2
    Yc = tmp[,1:2] + eta
    Yd = tmp[,3:4] + eta
    D = apply(Yd, 2, function(a) 1*(a>median(a)))
    ##
    Y = cbind(Yc,D)
    pval1 = MSKAT(MSKAT.cnull(Y,Xs), G, W.beta=c(1,1))
    pval2 = fgamut(Y,G,Xs, W.beta=c(1,1))
    Pval[ib,] = c(pval1, pval2)
  }
  return(Pval)
}
res1 = MixSize(1e3, n=1e3, yrho=0.25, grho=0.5, m=40, MAF0=0.001,MAF1=0.01)
colMeans(res1<1e-2)
res2 = MixSize(1e3, n=2e3, yrho=0.25, grho=0.5, m=40, MAF0=0.001,MAF1=0.01)
colMeans(res2<1e-2)
res5 = MixSize(1e3, n=5e3, yrho=0.25, grho=0.5, m=40, MAF0=0.001,MAF1=0.01)
colMeans(res5<1e-2)

On comment 4

Power comparison under population stratification.

library(CompQuadForm)
library(MSKAT)
PwrSim = function(B=1e2, n=1e3, yrho=0.25, grho=0.5, m=50, MAF0=0.001,MAF1=0.005, gps=0.01){
  ## sim data
  Pval = matrix(1, B,6)
  for(ib in 1:B){
    Xd = rnorm(n)
    Xg = rbinom(n,1,0.5); Xs = cbind(Xd,Xg)
    maf.1 = runif(m, MAF0,MAF1)
    maf.2 = outer(Xg*gps, maf.1, '+')
    qz = qnorm(maf.2)
    ##
    rho1 =sqrt(1-grho);  rho2 = sqrt(grho)
    Z1 = matrix(rnorm(n*m), n,m)*rho1 + rnorm(n)*rho2
    Z2 = matrix(rnorm(n*m), n,m)*rho1 + rnorm(n)*rho2
    G = 1*(Z1<qz) + 1*(Z2<qz)
    ## outcome sim
    eta = 1+Xd+Xg
    rho1 = sqrt(1-yrho);  rho2 = sqrt(yrho)
    tmp = matrix(rnorm(n*4),n,4)*rho1+rnorm(n)*rho2
    Y = tmp + eta
    Y[,1] = Y[,1] + rowSums(G[,sample(1:m, size=4, rep=FALSE)])*0.5
    Y[,2] = Y[,2] + rowSums(G[,sample(1:m, size=4, rep=FALSE)])*0.5
    Y[,3] = Y[,3] + rowSums(G[,sample(1:m, size=4, rep=FALSE)])*0.5
    Y[,4] = Y[,4] + rowSums(G[,sample(1:m, size=4, rep=FALSE)])*0.5
    ##
    pval1 = MSKAT(MSKAT.cnull(Y,Xs), G, W.beta=c(1,1))
    pval2 = fgamut(Y,G,Xs, W.beta=c(1,1))
    Pval[ib,] = c(pval1, pval2)
  }
  return(Pval)
}

res = PwrSim(1e2, n=1e3, yrho=0.25, grho=0.5, m=40, MAF0=0.001,MAF1=0.005, gps=0.01)
colMeans(res<1e-6)