Random Permutations and Permutation Tests


A.  Generating Random Permutations Efficiently
----------------------------------------------

Given an array (x1, x2, x3, ..., xN), it is possible to create a random permutation
by (1) writing out a list of n random numbers, (r1, r2, r3, ..., rN), (2) sorting
the list in ascending order, and carrying along the x's in parallel.

This is not an efficient way to create random permutations, because the process
of sorting a list of numbers uses a lot of computer time.  Some sorting algorithms
are more efficient than others; the bubble sort is rather inefficient, requiring
N * (N - 1) comparisons and moves.

It is possible to create a random permutation with almost no comparisons, and
with the number of moves being of order N.

Here is how this is done.

Assume you have two rows of boxes.  In the first row of boxes there are the
N numbers 1, 2, 3, ... N in order.

The second row of boxes is empty.

Begin by randomly choosing one number from the Row 1 and putting it in
the first box in Row 2.  Say that number came from the i-th box in Row 1.

Then take the number which is in the last position in Row 1.  Put
it into the i-th (now empty) box in Row 1.  Row 1 now has the first
N - 1 boxes nonempty and the remaining box is empty.

For the next step, choose a box randomly from the first N - 1 nonempty
boxes in Row 1.  Put the contents into the SECOND position in Row 2.
Again move the contents of the (N - 1)th box in Row 1 into the box
that was just vacated.

Keep doing this.  When you have finished, Row 2 is a random permutation
of Row 1.

This illustrates a general principle: usually it is easier to scramble 
something than it is to unscramble it.

B.  Permutation Tests
---------------------
Suppose you have conducted a clinical trial.  This is a simple
two-group design (Drug A vs. Drug B) with a dichotomous outcome:
e.g. X = 0 indicates failure, X = 1 indicates success.  The following
table summarizes the design:


           A       B
       |-------|-------|
       |       |       |
 X = 0 |   a   |   b   |   n1
       |       |       |
       |-------|-------|
       |       |       |
 X = 1 |   c   |   d   |   n2
       |       |       |
       |-------|-------|
           m1      m2

The usual tests for significance are Fisher's Exact test,
a chi-square test, or a Yates-corrected chi-square test.

There is however another way to produce a test statistic.
In the table above, m1 people were randomized to group A
and m2 people were randomized to group B.  This resulted
in the table entries that are shown: a, b, c, and d.

Now suppose after the fact you re-randomized the m1 + m2 people in
such a way that m1 of them were assigned to group A and m2
were assigned to group B.  The resulting 2 x 2 table has
entries a', b', c' and d', which in general are not equal to
the original entries.

One measure of the treatment effect in such a table is
the comparison of proportions of people in each group who
have X = 1, that is, a comparison of these proportions within
each group.  In the original table, these proportions are

     pA = c / m1  and  pB = d / m2.

The observed difference of these proportions is  d = pA - pB.

The question you want to answer is: how likely is it that, just
by chance - that is, assuming there is no true treatment effect -
that you would observe a difference as large as d, or larger?

So assume you carry out 10,000 "pseudo-randomizations".  That is,
you randomly assign m1 of the people to treatment A and m2 to
treatment B.  For each one of these pseudo-randomizations, you
get table entries a', b', c', and d'.  The corresponding proportions
are pA' = c' / m1 and pB' = d' / m2.  The corresponding difference of
proportions is d' = pA' - pB'.  You do this computation for each
of the 10,000 pseudo-randomizations.

[Note that you do NOT change the outcomes for the people in the
 study.  You are just randomly re-assigning them to a different
 pseudo-treatment-group.  Effectively you are just putting a different
 treatment label on then.]

Then you count how many times you find that d' is bigger than or
equal to d.  Say that happens in 320 cases out of the 10,000 re-
randomizations.  Then let r = 320 / 10,000 = .032.  This is an
estimate of the probability that, if there is no true treatment
effect, you would end up with an estimated effect which was bigger
than or equal to the effect d = pA - pB that was actually
observed.  So by this means you get an estimated p-value, .032.

Note that this is a one-sided test.  However it is easy to modify
this to get a two-sided test.  You would need to count how many
times out of 10,000 you find that d' <= -d and add that to the number of times
that you found d' >= d.  Say the resulting total is 606.  Then the
resulting estimate of the p-value is 0.0606.

This is called a permutation test.  This is a special case.  You
can also carry out permutation tests instead of using other test
statistics.  A great virtue of permutation tests is that they do not
depend on knowing underlying distributions.  In that sense they may
be preferable to using other tests in small-sample size situations where
you are uncertain what about what the variance structure of the
outcome variable might be.  Permutation tests are essentially non-
parametric tests which take advantage of high-speed computing.

Problem
-------
a) Suppose you are conducting a two-group randomized trial with
a quantitative outcome.  In many such trials, the test statistic
would be a two-sample t-test.  Write a macro to carry out a permutation
test instead of the t-test.  The form of the macro should look like
the following:

%macro permttest(datafile, treatment_group, outcome, nperm, nsides, permpvalue) ;

Here datafile is a file of the following form:

     Obs     treatment_group    outcome
    -----    ---------------    -------
      1             A             x1
      2             B             x2
      3             B             x3
      4             A             x4
      .             .              .
      .             .              .
      .             .              .
     96             B             x96
     97             B             x97
     98             A             x98
     99             A             x99
    100             A             x100

*** nperm indicates the number of permutations.
*** nsides = 1 or 2 depending on whether you want a one-sided
    test or a 2-sided one.
*** permpvalue = the estimated p-value based on the nperm pseudo-randomizations.

b) Find a real dataset for a two-group clinical trial with a
   quantitative outcome (e.g, weight change or blood pressure)
   and apply your permutation test to it.  Compare the result
   with what you would get from PROC NPAR1WAY and PROC TTEST.

Web address: http://www.biostat.umn.edu/~john-c/random.permutations

Last modified: September 29, 2012.