PROC FREQ:  Cross-Classified Data, Mantel-Haenszel, etc.    n54703.003

     This section of notes will correspond to Chapter 3 of the Der-Everitt
textbook.

     Cross-classified data usually represent counts in categories where the
categories are defined by one or more variables which take on a small number
of values.  The usual question one is asking with cross-classified data is
whether the counts in various cells are indicative of some kind of
relationship between the classifying variables.

     Cross-classified data are often called contingency tables.

     Suppose for example you randomly sample 100 men and 100 women from the
general population.  You ask each person:

     Do you enjoy watching skating on TV ? (yes or no)

     You obtain results as follows:


                      Men         Women
                  ----------------------
    Yes, Enjoy    |          |         |
    Skating       |    41    |    56   |   77
                  |          |         |
                  ----------------------
    No, Don't     |          |         |
    Enjoy Skating |    59    |    44   |  123
                  |          |         |
                  --------------------------
                      100        100   |  200


     It appears here that more women than men enjoy skating on TV.  [This is
not real data, however.]

     The relevant tests here for whether gender is related to enjoyment of
skating are: Fisher's exact test, chi-square test, continuity- corrected
chi-square test.  Here is how to carry out these tests in SAS:

========================================================================

options  linesize = 80 ;

footnote "~john-c/5421/n54703.003.sas &sysdate &systime" ;

data skating ;
     length gender $3  enjskate $4 ;
     input gender enjskate count ;
cards ;
   0-M   1-Yes  41
   1-F   1-Yes  56
   0-M   2-No   59
   1-F   2-No   44
;
run ;

proc freq data = skating ;
     weight count ;                      /*  Note weight statement ...*/
     tables enjskate * gender / chisq ;  /*  Note chisq option .......*/
title1 'proc freq analysis of 2 x 2 table on skating ...' ;
run ;

-----------------------------------------------------------------------

                proc freq analysis of 2 x 2 table on skating ...               1
                                                  17:21 Sunday, January 25, 2004

                          TABLE OF ENJSKATE BY GENDER

                      ENJSKATE     GENDER

                      Frequency|
                      Percent  |
                      Row Pct  |
                      Col Pct  |0-M     |1-F     |  Total
                      ---------+--------+--------+
                      1-Ye     |     41 |     56 |     97
                               |  20.50 |  28.00 |  48.50
                               |  42.27 |  57.73 |
                               |  41.00 |  56.00 |
                      ---------+--------+--------+
                      2-No     |     59 |     44 |    103
                               |  29.50 |  22.00 |  51.50
                               |  57.28 |  42.72 |
                               |  59.00 |  44.00 |
                      ---------+--------+--------+
                      Total         100      100      200
                                  50.00    50.00   100.00


                   STATISTICS FOR TABLE OF ENJSKATE BY GENDER

             Statistic                     DF     Value        Prob
             ------------------------------------------------------
             Chi-Square                     1     4.504       0.034
             Likelihood Ratio Chi-Square    1     4.521       0.033
             Continuity Adj. Chi-Square     1     3.924       0.048
             Mantel-Haenszel Chi-Square     1     4.482       0.034
             Fisher's Exact Test (Left)                       0.024
                                 (Right)                      0.988
                                 (2-Tail)                     0.047
             Phi Coefficient                     -0.150            
             Contingency Coefficient              0.148            
             Cramer's V                          -0.150            

             Sample Size = 200

 
                   ~john-c/5421/n54703.003.sas 25JAN04 17:21
========================================================================

Notes on Program:

     First, note that the variables 'gender' and 'enjskate' are both
alphanumeric variables rather than numeric.  They are specified as having
lengths 3 and 4 ($3 and $4) respectively.

     Note that in the "cards" datafile, I have coded men as "0-M" and women
as "1-F".  The reason I put 0 and 1 in front of the letters is because proc
freq will list the classifying variables alphabetically. I wanted the order
in the output table to be the same as the order in the input "cards".  
Because 'F' is alphabetically higher than 'M', if I had coded them simply as
M and F, they would have been printed on the output table in the reverse
order.

     Note also that there is an error in the 'length' statement: the length
for 'enjskate' should have been $5, not $4.  This does not cause the data to
be classified incorrectly in this case, but as you can see, in the printed
table, one of the values of enjskate is printed as '1-Ye' rather than
'1-Yes'.

     Note also the "count" variable.  The 4 input cards define 4 cells in
the table.  The "count" is the number of people in that cell.  It turns out
to be important to use this.  Note that in the proc freq section, there is
the line

     weight count ;

     This tells SAS that it should give a weight of the value of 'count' to
each cell.  If the 'weight count' statement is omitted, SAS assumes that
each cell has 1 observation in it.  This gives completely misleading and
useless statistics (try it and see!).

     The input data could have been structured differently, with one
observation for each of the 200 people (200 data lines).  If that had been
done, the 'count' variable would not be needed.  Obviously the 'cards'
section would have been a lot longer.  In general, it saves space to use
counts and the 'weight' statement in analyzing cross-classified data.

     Proc freq works equally well with numeric or alphanumeric variables.


Notes on Printout

     Proc freq prints several statistics for testing the null hypothesis
that there is no relationship between the two classifying variables: the
chi-square test, the likelihood ratio test, the continuity- adjusted
Chi-square (also known as Yates' correction), and the Fisher Exact test.  
The latter two tests are *conservative*, meaning that as a rule they are
less likely to reject the null hypothesis than the unadjusted chi-square
test or the likelihood ratio chi-square tests.  In this case the p-values
for the first two tests are about equal to .033, while the p-value for the
adjusted chi-square test and the Fisher Exact test are about equal to .048.

     Which test should you use?

     In general, the Fisher Exact Test is the most widely accepted. This is
partly because of its conservatism, partly because it is not based on an
asymptotic distribution (unlike the chi-square tests) and partly because it
has the word "Exact" in its title.  In fact the Fisher Exact Test assumes an
underlying hypergeometric distribution, although in most contingency tables,
the sampling distribution is not hypergeometric.  There is debate among
statisticians about which test is best.  The bottom line is, again, the
two-sided Fisher Exact Test is the most widely accepted and the least likely
to be questioned if the reported result is statistically significant.

     Note that if you do not include the 'chisq' option on the 'tables'
statement, you will not get any test statistics.

     The printout also shows a Mantel-Haenszel chi-square.  This is not
usually employed in the case of single 2 x 2 tables.  See below for more
discussion of the Mantel-Haenszel test. Also shown are the Phi coefficient,
the Contingency coefficient, and Cramer's V.  These are all aimed at
quantifying the relationship between the classifying variables.  They are
not often quoted in publications.  See SAS manuals for their definitions.

     Two statistics that are not shown in the printout are the odds ratio
(OR) and the relative risk (RR).  These are defined as follows:  Given a 2 x
2 contingency table like:

                      Men         Women
                  ----------------------
    Yes, Enjoy    |          |         |
    Skating       |     a    |     b   |  m1
                  |          |         |
                  ----------------------
    No, Don't     |          |         |
    Enjoy Skating |     c    |     d   |  m2
                  |          |         |
                  --------------------------
                       n1         n2   |  N

the *estimated risk* (of enjoying skating) among men and women is

      Risk(men)   = a / n1 = a / (a + c),  and

      Risk(women) = b / n2 = b / (b + d).

     The *relative risk* for men vs. women is defined as:

      RelRisk = RR = Risk(men)/Risk(women)

              = a * (b + d) / (b * (a + c)).

     Similarly, the estimated *odds* (of enjoying skating) for men and women
is:

     Odds(men)   = a / c, and

     Odds(women) = b / d.

     The estimated *odds ratio* for men vs. women is:

     Odds ratio = OR = Odds(men)/Odds(women) = a * d / (b * c).


     In the case of the data displayed above, we have:

     Risk(men)   = 41/100 = .41

     Risk(women) = 56/100 = .56

     RelRisk     = .41 / .56 = .732.


     Odds(men)   = 41/59 = .6949,

     Odds(women) = 56/44 = 1.2727, and

     OddsRatio   = .6949/1.2727 = .546.

    If there is no relationship between the classifying variables (in this
case, between gender and enjoyment of skating), then the *expected* values
of both the relative risk and the odds ratio are both 1.00.  Thus deviations
away from 1.00 indicate how strong the relationship of the row- and column
variables is.  The odds ratio is always farther away from 1.00 than is the
relative risk. The more natural statistic to think about is the relative
risk. Both the relative risk and the odds ratio may be applied in cohort
studies, but only the odds ratio is meaningful in case-control studies.

     SAS will print the odds ratio if you use the MEASURES option on the
TABLES statement; 95% confidence intervals for the OR are printed also.

------------------------------------------------------------------------

     Larger Contingency Tables

     Contingency tables with more than 2 rows and/or columns occur in many
health-related studies.  Here is a 3 x 4 example:


            Afro-      Asian-     Native     Euro-
            Amer.      Amer.       Amer.     Amer.
         ---------------------------------------------
         |          |          |          |          |
   Cat   |    32    |    45    |    30    |    20    |
         |          |          |          |          |
         ---------------------------------------------
         |          |          |          |          |
   Dog   |    48    |    30    |    50    |    70    |
         |          |          |          |          |
         ---------------------------------------------
         |          |          |          |          |
   Fish  |    20    |    25    |    20    |    10    |
         |          |          |          |          |
         ---------------------------------------------
             100        100        100        100


     The column-labels refer to ethnic groups.  The row-labels refer to
preferences for pets [I.e., which pet would you prefer? Cat, Dog, or Fish ?]

     The null hypothesis to be tested is that the proportions of preferences
are the same in each ethnic group.  To put it another way, if the null
hypothesis held exactly, the counts in each column would be a multiple of
the counts in any of the other columns.

     Here is how these data would be analyzed in SAS:

========================================================================

options  linesize = 80 ;
footnote "~john-c/5421/n54703.003.sas &sysdate &systime" ;

data d3x4 ;
     length ethnic $6 petpref $3 ;
     input petpref ethnic count ;
cards ;
cat   afro   32
cat   asian  45
cat   nativ  20
cat   euro   30
dog   afro   48
dog   asian  30
dog   nativ  70
dog   euro   50
fish  afro   20
fish  asian  25
fish  nativ  10
fish  euro   20
;

run ;

proc freq data = d3x4 ;
     tables petpref * ethnic / chisq ;
     weight count ;
title1 'Proc freq for a 3 x 4 table ... ' ;
run ;

------------------------------------------------------------------------

                        Proc freq for a 3 x 4 table ...                        1
                                                  18:47 Monday, January 26, 2004

                               The FREQ Procedure

                           Table of petpref by ethnic

             petpref     ethnic

             Frequency|
             Percent  |
             Row Pct  |
             Col Pct  |afro    |asian   |euro    |nativ   |  Total
             ---------+--------+--------+--------+--------+
             cat      |     32 |     45 |     30 |     20 |    127
                      |   8.00 |  11.25 |   7.50 |   5.00 |  31.75
                      |  25.20 |  35.43 |  23.62 |  15.75 |
                      |  32.00 |  45.00 |  30.00 |  20.00 |
             ---------+--------+--------+--------+--------+
             dog      |     48 |     30 |     50 |     70 |    198
                      |  12.00 |   7.50 |  12.50 |  17.50 |  49.50
                      |  24.24 |  15.15 |  25.25 |  35.35 |
                      |  48.00 |  30.00 |  50.00 |  70.00 |
             ---------+--------+--------+--------+--------+
             fis      |     20 |     25 |     20 |     10 |     75
                      |   5.00 |   6.25 |   5.00 |   2.50 |  18.75
                      |  26.67 |  33.33 |  26.67 |  13.33 |
                      |  20.00 |  25.00 |  20.00 |  10.00 |
             ---------+--------+--------+--------+--------+
             Total         100      100      100      100      400
                         25.00    25.00    25.00    25.00   100.00


                   Statistics for Table of petpref by ethnic

             Statistic                     DF       Value      Prob
             ------------------------------------------------------
             Chi-Square                     6     32.5319    <.0001
             Likelihood Ratio Chi-Square    6     33.4957    <.0001
             Mantel-Haenszel Chi-Square     1      0.2616    0.6090
             Phi Coefficient                       0.2852          
             Contingency Coefficient               0.2742          
             Cramer's V                            0.2017          
 
                   ~john-c/5421/n54703.003.sas 26JAN04 18:47

========================================================================

Multiple 2 x 2 Tables :  Mantel-Haenszel Statistics

     Assume you want to study the relationship of categorical variable B
(predictor or risk factor) to categorical variable A (outcome). However,
your dataset also contains information on another categorical variable, C,
which also can affect the outcome A.

     You want to estimate the effect of B "controlling for" the variable C.  
To do this, you need to stratify the data according to the levels of C.

     For example, assume A = occurrence of heart disease, B = exposure to
smoking, and C = gender.  You know that men are more likely to to have heart
disease than women, so, in estimating the effect of smoking on heart
disease, you need to control for gender.  The data may be displayed as
follows:

                          Men                      Women
                 ---------------------     ---------------------
                    Smoke    No Smoke         Smoke   No Smoke
                 ---------------------     ---------------------
                 |         |         |     |         |         |
    Heart Dis +  |    24   |    18   |     |    15   |    10   |
                 |         |         |     |         |         |
                 ---------------------     ---------------------
                 |         |         |     |         |         |
    Heart Dis -  |    76   |    82   |     |    85   |    90   |
                 |         |         |     |         |         |
                 ---------------------     ---------------------
                     100       100             100       100

                     OR = 1.439                OR = 1.588


     To combine the data from these two strata, an underlying condition
should be met.  The underlying condition is that the *relative effect* of
smoking is the same in both genders. This is usually interpreted as saying
that the odds ratio for smokers vs. nonsmokers is the same for the two
genders.  Another way of saying this is that there is *homogeneity* of the
odds ratio between the two genders.

     The Mantel-Haenszel analysis is based on computing a kind of average of
the odds ratios for the separate strata.

     There is a test for homogeneity: the Breslow-Day test.  If the
assumption of homogeneity is rejected, a combined analysis should not be
reported.  In that case the strata should be analyzed separately.  Separate
odds ratios for each stratum should be reported.

     In the example above, there are two strata.  However in general the
number of strata may be larger.

     To carry out a Mantel-Haenszel analysis, you need to use a tables
statement of the form

            tables = C * A * B / chisq  cmh  measures ;

where C is the stratifying variable, A is the outcome variable, and B is the
exposure variable of interest.  The 'cmh' option stands for
'Cochran-Mantel-Haenszel'.  The 'measures' option causes the printing of the
odds ratio for each individual 2 x 2 table.  The following example shows how
this is done for the gender-smoking-heart disease data:

========================================================================

options  linesize = 80 ;
footnote "~john-c/5421/n54703.003.sas &sysdate &systime" ;

data strata2 ;
  length gender $5  smoking $9  heartdis $9 ;
  input gender smoking heartdis count ;
cards ;
men    1-smoke   1-hdplus     24
men    2-nosmoke 1-hdplus     18
men    1-smoke   2-hdminus    76
men    2-nosmoke 2-hdminus    82
women  1-smoke   1-hdplus     15
women  2-nosmoke 1-hdplus     10
women  1-smoke   2-hdminus    85
women  2-nosmoke 2-hdminus    90
;
run ;

proc freq data = strata2 ;
     weight count ;
     tables gender * heartdis * smoking / chisq cmh measures ;
title1 'Proc freq used to carry out Mantel-Haenszel analysis ...' ;
run ;
endsas ;

------------------------------------------------------------------------

            Proc freq used to carry out Mantel-Haenszel analysis ...           1
                                                  19:27 Monday, January 26, 2004

                               The FREQ Procedure

                         Table 1 of heartdis by smoking
                           Controlling for gender=men

                      heartdis     smoking

                      Frequency |
                      Percent   |
                      Row Pct   |
                      Col Pct   |1-smoke |2-nosmok|  Total
                                |        |e       |
                      ----------+--------+--------+
                      1-hdplus  |     24 |     18 |     42
                                |  12.00 |   9.00 |  21.00
                                |  57.14 |  42.86 |
                                |  24.00 |  18.00 |
                      ----------+--------+--------+
                      2-hdminus |     76 |     82 |    158
                                |  38.00 |  41.00 |  79.00
                                |  48.10 |  51.90 |
                                |  76.00 |  82.00 |
                      ----------+--------+--------+
                      Total          100      100      200
                                   50.00    50.00   100.00


                 Statistics for Table 1 of heartdis by smoking
                           Controlling for gender=men

             Statistic                     DF       Value      Prob
             ------------------------------------------------------
             Chi-Square                     1      1.0850    0.2976
             Likelihood Ratio Chi-Square    1      1.0880    0.2969
             Continuity Adj. Chi-Square     1      0.7535    0.3854
             Mantel-Haenszel Chi-Square     1      1.0796    0.2988
             Phi Coefficient                       0.0737          
             Contingency Coefficient               0.0735          
             Cramer's V                            0.0737          


                              Fisher's Exact Test
                       ----------------------------------
                       Cell (1,1) Frequency (F)        24
                       Left-sided Pr <= F          0.8880
                       Right-sided Pr >= F         0.1928
                                                         
                       Table Probability (P)       0.0808
                       Two-sided Pr <= P           0.3856
 
 
 
 
 
 
 
                   ~john-c/5421/n54703.003.sas 26JAN04 19:27
            Proc freq used to carry out Mantel-Haenszel analysis ...           2
                                                  19:27 Monday, January 26, 2004

                               The FREQ Procedure
 
                 Statistics for Table 1 of heartdis by smoking
                           Controlling for gender=men

             Statistic                              Value       ASE
             ------------------------------------------------------
             Gamma                                 0.1799    0.1694
             Kendall's Tau-b                       0.0737    0.0702
             Stuart's Tau-c                        0.0600    0.0574

             Somers' D C|R                         0.0904    0.0861
             Somers' D R|C                         0.0600    0.0574

             Pearson Correlation                   0.0737    0.0702
             Spearman Correlation                  0.0737    0.0702

             Lambda Asymmetric C|R                 0.0600    0.1219
             Lambda Asymmetric R|C                 0.0000    0.0000
             Lambda Symmetric                      0.0423    0.0866

             Uncertainty Coefficient C|R           0.0039    0.0075
             Uncertainty Coefficient R|C           0.0053    0.0101
             Uncertainty Coefficient Symmetric     0.0045    0.0086


                  Estimates of the Relative Risk (Row1/Row2)
 
       Type of Study                   Value       95% Confidence Limits
       -----------------------------------------------------------------
       Case-Control (Odds Ratio)      1.4386        0.7243        2.8573
       Cohort (Col1 Risk)             1.1880        0.8731        1.6164
       Cohort (Col2 Risk)             0.8258        0.5647        1.2077

                               Sample Size = 200
 
 
                   ~john-c/5421/n54703.003.sas 26JAN04 19:27
            Proc freq used to carry out Mantel-Haenszel analysis ...           3
                                                  19:27 Monday, January 26, 2004

                               The FREQ Procedure

                         Table 2 of heartdis by smoking
                          Controlling for gender=women

                      heartdis     smoking

                      Frequency |
                      Percent   |
                      Row Pct   |
                      Col Pct   |1-smoke |2-nosmok|  Total
                                |        |e       |
                      ----------+--------+--------+
                      1-hdplus  |     15 |     10 |     25
                                |   7.50 |   5.00 |  12.50
                                |  60.00 |  40.00 |
                                |  15.00 |  10.00 |
                      ----------+--------+--------+
                      2-hdminus |     85 |     90 |    175
                                |  42.50 |  45.00 |  87.50
                                |  48.57 |  51.43 |
                                |  85.00 |  90.00 |
                      ----------+--------+--------+
                      Total          100      100      200
                                   50.00    50.00   100.00


                 Statistics for Table 2 of heartdis by smoking
                          Controlling for gender=women

             Statistic                     DF       Value      Prob
             ------------------------------------------------------
             Chi-Square                     1      1.1429    0.2850
             Likelihood Ratio Chi-Square    1      1.1497    0.2836
             Continuity Adj. Chi-Square     1      0.7314    0.3924
             Mantel-Haenszel Chi-Square     1      1.1371    0.2863
             Phi Coefficient                       0.0756          
             Contingency Coefficient               0.0754          
             Cramer's V                            0.0756          


                              Fisher's Exact Test
                       ----------------------------------
                       Cell (1,1) Frequency (F)        15
                       Left-sided Pr <= F          0.9006
                       Right-sided Pr >= F         0.1964
                                                         
                       Table Probability (P)       0.0970
                       Two-sided Pr <= P           0.3928
 
 
 
 
 
 
 
                   ~john-c/5421/n54703.003.sas 26JAN04 19:27
            Proc freq used to carry out Mantel-Haenszel analysis ...           4
                                                  19:27 Monday, January 26, 2004

                               The FREQ Procedure
 
                 Statistics for Table 2 of heartdis by smoking
                          Controlling for gender=women

             Statistic                              Value       ASE
             ------------------------------------------------------
             Gamma                                 0.2273    0.2064
             Kendall's Tau-b                       0.0756    0.0697
             Stuart's Tau-c                        0.0500    0.0466

             Somers' D C|R                         0.1143    0.1050
             Somers' D R|C                         0.0500    0.0466

             Pearson Correlation                   0.0756    0.0697
             Spearman Correlation                  0.0756    0.0697

             Lambda Asymmetric C|R                 0.0500    0.1289
             Lambda Asymmetric R|C                 0.0000    0.0000
             Lambda Symmetric                      0.0400    0.1037

             Uncertainty Coefficient C|R           0.0041    0.0077
             Uncertainty Coefficient R|C           0.0076    0.0141
             Uncertainty Coefficient Symmetric     0.0054    0.0100


                  Estimates of the Relative Risk (Row1/Row2)
 
       Type of Study                   Value       95% Confidence Limits
       -----------------------------------------------------------------
       Case-Control (Odds Ratio)      1.5882        0.6766        3.7282
       Cohort (Col1 Risk)             1.2353        0.8666        1.7609
       Cohort (Col2 Risk)             0.7778        0.4712        1.2839

                               Sample Size = 200
 
 
                   ~john-c/5421/n54703.003.sas 26JAN04 19:27
            Proc freq used to carry out Mantel-Haenszel analysis ...           5
                                                  19:27 Monday, January 26, 2004

                               The FREQ Procedure

                   Summary Statistics for heartdis by smoking
                             Controlling for gender

          Cochran-Mantel-Haenszel Statistics (Based on Table Scores)
 
        Statistic    Alternative Hypothesis    DF       Value      Prob
        ---------------------------------------------------------------
            1        Nonzero Correlation        1      2.1868    0.1392
            2        Row Mean Scores Differ     1      2.1868    0.1392
            3        General Association        1      2.1868    0.1392


               Estimates of the Common Relative Risk (Row1/Row2)
 
   Type of Study     Method                  Value     95% Confidence Limits
   -------------------------------------------------------------------------
   Case-Control      Mantel-Haenszel        1.4959       0.8765       2.5531
     (Odds Ratio)    Logit                  1.4956       0.8762       2.5530

   Cohort            Mantel-Haenszel        1.2069       0.9562       1.5232
     (Col1 Risk)     Logit                  1.2081       0.9575       1.5243

   Cohort            Mantel-Haenszel        0.8068       0.5958       1.0926
     (Col2 Risk)     Logit                  0.8079       0.5968       1.0937


                              Breslow-Day Test for
                         Homogeneity of the Odds Ratios
                         ------------------------------
                         Chi-Square              0.0314
                         DF                           1
                         Pr > ChiSq              0.8594


                            Total Sample Size = 400
 
 
                   ~john-c/5421/n54703.003.sas 26JAN04 19:27

========================================================================
Notes on printout:

     Note that the odds ratios and their 95% confidence intervals
for the individual tables are printed in the lines:

       Case-Control (Odds Ratio)      1.4386        0.7243        2.8573

       Case-Control (Odds Ratio)      1.5882        0.6766        3.7282

and the overall combined Mantel-Haenszel odds ratio is 1.4959:

   Case-Control      Mantel-Haenszel        1.4959       0.8765       2.5531
     (Odds Ratio)    Logit                  1.4956       0.8762       2.5530

     The two odds ratios for the two strata (1.4386 and 1.5882) in this case
are not very different, and their respective confidence intervals contain
the other OR estimate.  The 'Breslow-Day test for homogeneity' has a large
p-value (0.8594), so one would not reject the hypothesis that the true odds
ratios in the two strata are the same, and the analysis based on the
combined odds ratio can be accepted.

     However, in this case, the association between smoking and heart
disease is not strong enough to provide statistically significant evidence
of an overall association.  This is shown in the overall test for
association in the line of the printout labelled 'General Association',
where the p-value is 0.1392:

        Statistic    Alternative Hypothesis    DF       Value      Prob
        ---------------------------------------------------------------
            1        Nonzero Correlation        1      2.1868    0.1392
            2        Row Mean Scores Differ     1      2.1868    0.1392
            3        General Association        1      2.1868    0.1392

------------------------------------------------------------------------

THE COL1 AND COL2 RELATIVE RISKS, AND THE ODDS RATIO

  Note that in the preceding example, the printout includes computations
of the "Cohort (Col 1)" and "Cohort (Col 1)" relative risks.  What are
these numbers?

  These are computed by first computing, for each row, the probability that an
observation is in the first column.  One then computes the quotient of these
two probabilities.  Here is how this is computed for the following table:

                      heartdis     smoking

                      Frequency |
                      Percent   |
                      Row Pct   |
                      Col Pct   |1-smoke |2-nosmok|  Total
                                |        |e       |
                      ----------+--------+--------+
                      1-hdplus  |     15 |     10 |     25
                                |   7.50 |   5.00 |  12.50
                                |  60.00 |  40.00 |
                                |  15.00 |  10.00 |
                      ----------+--------+--------+
                      2-hdminus |     85 |     90 |    175
                                |  42.50 |  45.00 |  87.50
                                |  48.57 |  51.43 |
                                |  85.00 |  90.00 |
                      ----------+--------+--------+
                      Total          100      100      200
                                   50.00    50.00   100.00

 Column 1 Relative Risk:

   Col 1 risk for the first row :    15/25  = 0.6000
   Col 1 risk for the second row:    85/175 = 0.4857

   Col 1 Relative Risk          :  0.6000/0.4857 = 1.235.

The Col 2 Relative Risk is computed similarly:

   Col 2 risk for the first row :    10/25  = 0.4000
   Col 2 risk for the second row:    90/175 = 0.5143

   Col 2 Relative Risk          :  0.4000/0.5143 = 0.7778.

  As it happens, the quotient of the Col1 relative risk divided by the Col2
relative risk equals the overall odds ratio:

         OR = (Col1 rel risk)/(Col2 rel risk) = 1.588.

==================================================================================

PAIR-MATCHED DATA:

     Some studies are designed so that the data occur essentially as matched
pairs.  An example of this would be a clinical trial on a treatment for eye
disease, in which, for each person, one eye is randomized to receive the
experimental treatment and the other eye is randomized to standard therapy
(control).  A correct analysis for such data is computation of the McNemar
chi-square statistics from the discordant pairs.  This is discussed in the
Der-Everitt textbook (in section 3.3.5, pp. 70-72), in which the pairs were
pairs of women, one of whom was assigned to use a birth-control pill and the
other was assigned to no pill.  The outcome was occurrence of blood clots.  
Please read this example in the textbook.

  Here is another example (not real data).  It is assumed that a study
is being carried out in patients with ocular hypertension (high
pressure inside the eye) in both eyes.  Each patient is randomized
to have one eye treated with an active drug and the other eye treated with
a placebo.  At the end of the treatment period the two eyes are
evaluated to see if the ocular hypertension is resolved (that is, the
person has normal pressure inside the eye).

  The data are represented as pairs of eyes.  The drug-treated eye
can be either a success (treat = 1) or a failure (treat = 0).  Similarly
the control eye: control = 1 denotes success, control = 0 denotes
failure.  Here is the program and the data, followed by the output:

========================================================================

options linesize = 80 ;
footnote "~john-c/5421/pair.sas &sysdate &systime" ;

data pairs ;
     input  control treat  count ;

cards ;
0  0  20
0  1  30
1  0  15
1  1  25
;

proc format ;
     value outcome 0 = '0: failure'
                   1 = '1: success' ;

run ;

proc freq data = pairs ;
     weight count ;
     tables control * treat / agree ;
format treat control outcome. ;
title1  'Use of PROC FREQ to analyze matched pairs data ... ' ;
title2  'Assume data are pairs of eyes, one treated, one control' ;
title3  'where treat = 0 means the treated eye was a failure' ;
title4  '      treat = 1 means the treated eye was a success' ;
title5  '    control = 0 means the control eye was a failure' ;
title6  ' and control = 1 means the control eye was a success' ;
title7  'The paired analysis (option agree on the tables statement' ;
title8  'causes the McNemar chi-square test to be performed.' ;
title9  'It also causes printing of the kappa statistic for agreement.' ;
run ;

------------------------------------------------------------------------

              Use of PROC FREQ to analyze matched pairs data ...               1
            Assume data are pairs of eyes, one treated, one control
              where treat = 0 means the treated eye was a failure
                    treat = 1 means the treated eye was a success
                  control = 0 means the control eye was a failure
               and control = 1 means the control eye was a success
           The paired analysis (option agree on the tables statement
              causes the McNemar chi-square test to be performed.
         It also causes printing of the kappa statistic for agreement.
                                               18:41 Wednesday, February 2, 2005

                           TABLE OF CONTROL BY TREAT

                     CONTROL     TREAT

                     Frequency  |
                     Percent    |
                     Row Pct    |
                     Col Pct    |0: failu|1: succe|  Total
                                |re      |ss      |
                     -----------+--------+--------+
                     0: failure |     20 |     30 |     50
                                |  22.22 |  33.33 |  55.56
                                |  40.00 |  60.00 |
                                |  57.14 |  54.55 |
                     -----------+--------+--------+
                     1: success |     15 |     25 |     40
                                |  16.67 |  27.78 |  44.44
                                |  37.50 |  62.50 |
                                |  42.86 |  45.45 |
                     -----------+--------+--------+
                     Total            35       55       90
                                   38.89    61.11   100.00


                    STATISTICS FOR TABLE OF CONTROL BY TREAT

                                 McNemar's Test
                                 --------------
             Statistic = 5.000        DF = 1        Prob = 0.025   


                            Simple Kappa Coefficient
                            ------------------------
                                              95% Confidence Bounds
             Kappa = 0.024     ASE = 0.100       -0.171    0.219   


             Sample Size = 90

 
                      ~john-c/5421/pair.sas 02FEB05 18:41
========================================================================

Notes on the printout:

1.  Note the use of PROC FORMAT to label the outcome status of the
    two eyes, and the corresponding use of the format statement in
    the proc freq section.

2.  Note that the printout includes both the McNemar chi-square statistic
    the kappa statistic.  These are used for very different purposes.
    The McNemar test is used to assess whether there is a treatment
    effect.  The kappa statistic, which will not be discussed further
    here, is a measure of agreement between two methods of evaluation.

    The McNemar chi-square in this example turns out to equal
    5.0; it is compared to a chi-square distribution with 1 degree
    of freedom, and the corresponding p-value is p = 0.025, which
    indicates there may be a relationship betweedn treatment and
    outcome.

    The McNemar odds ratio, which is NOT printed by proc freq,
    is the ratio of the counts of the discordant pairs - that is,


                               Number where Treated = 1, Control = 0
          McNemar OddsRatio =  -------------------------------------
                               Number where Treated = 0, Control = 1


    In this example, the McNemar Odds Ratio equals 30/15 = 2.00.

    The printout also does not include a confidence interval for the
    McNemar odds ratio.  This can be derived as follows.  First,
    find the natural logarithm of the McNemar odds ratio: in this case, it
    is log(30/15) = log(2) = 0.693.

    Then note that the standard error of log(McNemar OR) is:

              sqrt(1/b + 1/c),

    where b = number of pairs for which Treat = 1, Control = 0, and
          c = number of pairs for which Treat = 0, Control = 1.

    Then find a 95% confidence interval for log(true OR):

         log(McNemar OR) +/- 1.96 * sqrt(1/b + 1/c)

       = 0.693 +/- 1.96 * sqrt(1/30 + 1/15)

       = 0.693 +/- 1.96 * sqrt(0.1)

       = 0.693 +/- 0.620 = (0.073, 1.313).

    Finally, take the exponential of these latter two numbers to
    get a 95% confidence interval for the odds ratio itself:

       95% CI for the OR:  (1.075, 3.717).

    Note that the estimated OR, 2.00, is contained in this interval,
but the interval is not symmetric around the OR - that is, 2.00 is
not in the middle of the interval.

========================================================================

REFERENCES:

     The paper of Mantel and Haenszel was published in 1959 in the Journal
of the National Cancer Institute (where both authors worked at the time).  
This was a rather obscure journal for a statististics paper.  The paper is
one of the most-quoted publications in statistics of all time, even though
some of the key ideas were derived previously by William Cochran.  The
reference for the M-H paper is:

  1.  Mantel N and Haenszel W (1959).  Statistical aspects of the
      analysis of data from retrospective studies of disease.  J. 
      Natl Cancer Inst. 22, 719-748.

    Cochran's earlier paper was:

  2.  Cochran WG (1954).  Some methods of strengthening the common
      [chi-square] tests.  Biometrics 10, 417-451.

    The original reference for the McNemar test is:

  3.  McNemar Q (1947).  Note on the sampling error of the
      difference between correlated proportions or percentages.
      Psychometrika, 12, 153-157.

     A good reference for both Mantel-Haenszel and McNemar (and
many other categorical data methods, including the kappa statistic) is:

  4.  Fleiss J.  (1981) Statistical Methods for Rates and Proportions,
      2nd Edition.  John Wiley & Sons, New York.

================================================================================

PROBLEM 1.  

The following table is from an article in the Amer J Epidemiology, Vol 154,
1029-1036 (Balluz L et. al (2001) Investigation of systemic lupus erythematosus
in Nogales, Arizona).


                                          Cases (%)      Controls(%)
      Demographic characteristics          (n = 19)       (n = 36)     p-value
      -----------------------------      -----------     -----------  ---------
      Married                                 53%            64%        0.52
      Family origin Mexico                   100%            89%        0.39
      Born in Mexico                          67%            75%        0.65
      Employed                                95%            89%
      Education beyond high school          31.6%           8.3%        0.02

Use proc freq to compute the p-values.  [Note: you will not get the same
answers as shown above because the authors did a *matched* analysis.  They
did not provide sufficient information on the matching that the reader can
reproduce their results.]



PROBLEM 2.

     The following data come from another article in Am J. Epi
(Bensyl DM et. al (2001) Factors associated with pilot fatality in work-
related aircraft crashes.  Am J Epi 154; 1037-1042):

     Use of shoulder
        restraint        Fatal      Nonfatal       OR and 95% Conf Int
    ----------------    -------     --------       -------------------
           Yes             52         468           0.45 (0.25, 0.80)

            No             19          77

     Use proc freq to verify the OR and the confidence interval.



PROBLEM 3.

     The following data are from a clinical trial of a treatment for
retinopathy of prematurity.  The patients were premature babies who were at
high risk of retinal detachment.  One eye of each baby was randomized to
cryotherapy, while the other was randomized to control. The outcome data
were as follows:

      treated eye good,  control eye good :    N = 68
      treated eye good,  control eye bad  :    N = 34
      treated eye bad,   control eye good :    N =  6
      treated eye bad,   control eye bad  :    N = 27

     Use proc freq to perform a McNemar test on these data, and compute the
estimated odds ratio for treated vs control, and its 95% confidence
interval.

[Data from: Hardy RJ et al (1991) Statistical considerations in terminating
 randomization in the multicenter trial of cryotherapy for retinopathy of
 prematurity.  Contr Clin Trials 12:193-303.]


PROBLEM 4.

     These data are from Chin et al. (1961), The influence of Salk
vaccination on the epidemiologic pattern and spread of the virus in the
community. Amer J. Hyg.  73: 67-94.  The outcome variable is paralysis; the
exposure variable of interest is the Salk vaccine; the stratifying factor is
age:

----------------------------------------------------------------------------------
                                                             Paralysis
                                                         -----------------
              Age group            Salk vaccine            no         yes
             -----------          --------------         ------      -----
                 0-4                   Yes                 20          14
                                       No                  10          24

                 5-9                   Yes                 15          12
                                       No                   3          15

                10-14                  Yes                  3           2
                                       No                   3           2

                15-19                  Yes                  7           4
                                       No                   1           6

                20-39                  Yes                 12           3
                                       No                   7           5

                 40+                   Yes                  1           0
                                       No                   3           2
----------------------------------------------------------------------------------

     Use proc freq to analyze the data in this table.  The objective is to
investigate whether the Salk vaccine is a risk factor for paralysis.


n54703.002  Last update: February 1, 2005.