PROC NPAR1WAY: Non-Parametric Statistics                         n54703.002

     Nonparametric statistics are used for testing statistical hypotheses
in situations where the true distributions of the variables are not known
or where they are known but those distributions are not close to those
which are assumed in carrying out parametric tests.  Here by 'parametric
tests' we mean tests based on specific distributions, such as z-tests
(normal distribution) chi-squared tests (chi-square distribution),
and many others.

     The following program shows the use of nonparametric statistics on
a dataset showing weight gains in animals for 5 different dose levels
of a substance called 'gossypol' (constituent of cottonseed):

----------------------------------------------------------------------------------

FILENAME GRAPH 'gsas.grf' ;

OPTIONS  LINESIZE = 80 PAGESIZE = 30 ;

GOPTIONS
         RESET = GLOBAL
         ROTATE = PORTRAIT
         FTEXT = SWISSB
         DEVICE = PSCOLOR
         GACCESS = SASGASTD
         GSFNAME = GRAPH
         GSFMODE = REPLACE
         GUNIT = PCT BORDER
         CBACK = WHITE
         HTITLE = 2 HTEXT = 1 ;

*===================================================================== ;        
footnote "~john-c/5421/gossypol.sas &sysdate &systime" ;
data gossypol ;

     input dose n ;

     do i = 1 to n ;

        input wgtgain @@ ;
        output;

     end ;

     datalines ;
     0 16
       228 229 218 216 224 208 235 229 233 219 224 220 232 200 208 232
     .04 11
       186 229 220 208 228 198 222 273 216 198 213
     .07 12
       179 193 183 180 143 204 114 188 178 134 208 196
     .10 17
       130  87 135 116 118 165 151  59 126  64  78  94 150 160 122 110 178
     .13 11
       154 130 130 118 118 104 112 134  98 100 104
     ;

symbol1 v = 'o' w = 2 h = 5 c = black ;
symbol2 v = 'o' w = 2 h = 5 c = black ;
symbol3 v = 'o' w = 2 h = 5 c = black ;
symbol4 v = 'o' w = 2 h = 5 c = black ;
symbol5 v = 'o' w = 2 h = 5 c = black ;

proc plot data = gossypol ;
     plot wgtgain * dose ;
title1 'Weight gains vs Dose Levels of Gossypol' ;
run ;

proc npar1way anova wilcoxon median edf data = gossypol ;
     class dose ;
     var wgtgain ;
title1 'PROC NPAR1WAY applied to data on weight gain vs 5 levels of gossypol.' ;
endsas;

----------------------------------------------------------------------------------

                    Weight gains vs Dose Levels of Gossypol                    1
                                               18:59 Wednesday, January 21, 2004

           Plot of WGTGAIN*DOSE.  Legend: A = 1 obs, B = 2 obs, etc.

WGTGAIN |
    300 +
        |                    A
        |
        |D
        |I                   F
    200 +C                   C              D
        |                    A              E              A
        |                                                  D              A
        |                                   B              B              C
        |                                   A              E              C
    100 +                                                  A              D
        |                                                  B
        |                                                  B
        |
        |
      0 +
        -+---------+---------+---------+---------+---------+---------+---------+
       0.00      0.02      0.04      0.06      0.08      0.10      0.12     0.14

                                          DOSE
 
                    ~john-c/5421/gossypol.sas 21JAN04 18:59

      PROC NPAR1WAY applied to data on weight gain vs 5 levels of gossypol.     2
                                               18:59 Wednesday, January 21, 2004

                       N P A R 1 W A Y  P R O C E D U R E

                   Analysis of Variance for Variable WGTGAIN
                          Classified by Variable DOSE


    DOSE          N          Mean                    Among MS     Within MS
                                                   35020.7465    627.451597
    0            16    222.187500
    0.04         11    217.363636                     F Value      Prob > F
    0.07         12    175.000000                      55.814        0.0001
    0.1          17    120.176471
    0.13         11    118.363636
                       Average Scores Were Used for Ties
 
 
                    ~john-c/5421/gossypol.sas 21JAN04 18:59

      PROC NPAR1WAY applied to data on weight gain vs 5 levels of gossypol.     3
                                               18:59 Wednesday, January 21, 2004

                       N P A R 1 W A Y  P R O C E D U R E

                Wilcoxon Scores (Rank Sums) for Variable WGTGAIN
                          Classified by Variable DOSE


                           Sum of      Expected       Std Dev          Mean
    DOSE          N        Scores      Under H0      Under H0         Score

    0            16    890.500000         544.0    67.9789655    55.6562500
    0.04         11    555.000000         374.0    59.0635883    50.4545455
    0.07         12    395.500000         408.0    61.1366221    32.9583333
    0.1          17    275.500000         578.0    69.3807412    16.2058824
    0.13         11    161.500000         374.0    59.0635883    14.6818182
                       Average Scores Were Used for Ties


             Kruskal-Wallis Test (Chi-Square Approximation)
             CHISQ =  52.666     DF =  4     Prob > CHISQ = 0.0001

 
                    ~john-c/5421/gossypol.sas 21JAN04 18:59

      PROC NPAR1WAY applied to data on weight gain vs 5 levels of gossypol.     4
                                               18:59 Wednesday, January 21, 2004

                       N P A R 1 W A Y  P R O C E D U R E

                 Median Scores (Number of Points Above Median)
                              for Variable WGTGAIN
                          Classified by Variable DOSE


                           Sum of      Expected       Std Dev          Mean
    DOSE          N        Scores      Under H0      Under H0         Score

    0            16          16.0    7.88059701    1.75790231    1.00000000
    0.04         11          11.0    5.41791045    1.52735508    1.00000000
    0.07         12           6.0    5.91044776    1.58096271    0.50000000
    0.1          17           0.0    8.37313433    1.79415153    0.00000000
    0.13         11           0.0    5.41791045    1.52735508    0.00000000
                       Average Scores Were Used for Ties


             Median 1-Way Analysis (Chi-Square Approximation)
             CHISQ =  54.176     DF =  4     Prob > CHISQ = 0.0001

 
 
                    ~john-c/5421/gossypol.sas 21JAN04 18:59

      PROC NPAR1WAY applied to data on weight gain vs 5 levels of gossypol.     5
                                               18:59 Wednesday, January 21, 2004

                       N P A R 1 W A Y  P R O C E D U R E

                  Kolmogorov-Smirnov Test for Variable WGTGAIN
                          Classified by Variable DOSE

                                                                      Deviation
                                                   EDF                from Mean
            DOSE                  N            at Maximum            at Maximum

            0                    16            0.00000000           -1.91044776
            0.04                 11            0.00000000           -1.58405960
            0.07                 12            0.33333333           -0.49979576
            0.1                  17            1.00000000            2.15386115
            0.13                 11            1.00000000            1.73256519
                               ----           -----------
                                 67            0.47761194

              Maximum Deviation Occurred at Observation   36
              Value of WGTGAIN at Maximum  178.000000

                   Kolmogorov-Smirnov Statistic (Asymptotic)
                       KS = 0.457928      KSa =  3.74830
 
 
 
 
                    ~john-c/5421/gossypol.sas 21JAN04 18:59


      PROC NPAR1WAY applied to data on weight gain vs 5 levels of gossypol.     6
                                               18:59 Wednesday, January 21, 2004

                       N P A R 1 W A Y  P R O C E D U R E

                   Cramer-von Mises Test for Variable WGTGAIN
                          Classified by Variable DOSE

                                                  Summed
                                                Deviation
            DOSE                  N             from Mean

            0                    16            2.16521023
            0.04                 11            0.91827966
            0.07                 12            0.34822684
            0.1                  17            1.49754164
            0.13                 11            1.33574457


                    Cramer-von Mises Statistic (Asymptotic)
                       CM = 0.093508      CMa =  6.26500
 
                    ~john-c/5421/gossypol.sas 21JAN04 18:59

----------------------------------------------------------------------------------

DISCUSSION OF EXAMPLE

Note the input statements:

     input dose n ;

     do i = 1 to n ;

        input wgtgain @@ ;
        output;

     end ;

The data file in this case is structured as follows: on the first line,
the dose level and the number of animals given that dose are specified:
for example, the first line is: 0  16 , meaning dose level 0 and n = 16
animals.  Then on the second line, the weight gains for each of the 16 animals 
are given.  The third line says: .04  11 , meaning the dose was .04 and
n = 11 animals got that dose.  And so on.

Note that the second 'input' statement is :  input  wgtgain @@ ;
The two @ signs mean: continue reading observations until you reach the
end of the line or you have read in  n  observations.

Note the 'output' statement: this means that after each weight gain is
read in, a line of the dataset will be created.  The first 5 lines of the
data set will look like the following:

     Obs    dose    n     wgtgain
    -----  ------  ---   ---------
      1      0      16      228
      2      0      16      229
      3      0      16      218
      4      0      16      216
      5      0      16      224

After the data were read in, PROC PLOT was called to plot the weight gains
versus the dose levels.  A lot of the points are overlapping.  PROC GPLOT
could also be used here and the better resolution would prevent overprinting
of the points.  The output from PROC PLOT shows rather convincing evidence
of differences in weight gain between the dose groups.  There appears to be
a downward trend in weight gain as the dose is increased.

PROC NPAR1WAY is invoked next:

   proc npar1way anova wilcoxon median edf data = gossypol ;

The options used here are described as follows:

1.  anova:    Performs a standard one-way analysis of variance to test
              differences in means between the various groups.  This test is
              appropriate if measurement error of the outcome variable
              is known to have a normal distribution.

2.  Wilcoxon: This is a true nonparametric test, known also as the rank-sum
              test.  It is appropriate for comparing data from continuous
              distributions in which the various groups differ in that
              the distributions are *shifted* away from each other.  When
              more than two groups are being compared, this is also known
              as the Kruskal-Wallis test.

3.  median:   This test compares the observations to the overall median of
              the whole dataset.  It is a powerful test when the distributions
              are symmetric and have heavy tails.

4.  edf:      Here 'edf' stands for 'empirical distribution function'.  The
              test is also known as the Kolmogorov-Smirnov test for
              comparing the cumulative distributions between the groups.
              The Cramer-von Mises test is also performed.


With this dataset, the first 3 tests indicate significant differences
between the groups, as one might expect from examining the PROC PLOT
output.  The datasets are sufficiently small here that one probably
is not very certain regarding the distribution of the outcome variable
(weight gain) within the groups, so nonparametric tests are appropriate.

But which one should you use ?  Why not just use standard one-way
analysis of variance?  Which one should you pick if the results are
different ?

It must be admitted that there is not clear guidance on these questions.
In general, with data of the kind shown here which almost certainly has
a continuous underlying distribution and the assumption that the distributions
of weight gain within the groups are shifts of one another is plausible,
I would recommend the Wilcoxon test.  The median and edf tests are likely
to be less powerful.

The ANOVA test does not appear to be a nonparametric test, but it is
usually treated as such.  This is because for moderate to large sample
sizes, means have approximately normal distributions.  Here however the
sample sizes are small.  A few extreme observations can distort the findings
of the ANOVA test.  If there appear to be outliers within the groups, one
would prefer the Wilcoxon test (which is affected only by ranks of the
outcome variable, not by its absolute magnitudes).

Note that the plot indicates a *trend* in the data.  None of the
tests used here provides a test for trend.  A parametric procedure,
such as PROC REG, certainly provides a test for trend, but it
is likely to be most powerful when the trend is linear, and, like
ANOVA, the underlying assumption is that measurement errors in the
outcome variable are normally and independently distributed.  There
are nonparametric tests for monotone trends, but they are not
available in PROC NPAR1WAY.

Other test options available in PROC NPAR1WAY include:

5.  VW:     Van der Waerden test: good for testing normality of a
            distribution

6.  Klotz:  Square of the VW test.

7.  Savage: Powerful for testing against an exponential distribution
            or extreme-value distributions

8.  Siegel-Tukey:  Properties not discussed in SAS/STAT Ver 8.

9.  Ansari-Bradley: Properties not discussed in SAS/STAT Ver 8.

10. Mood:  Related to Wilcoxon.  Properties not discussed in SAS/STAT Ver 8.

----------------------------------------------------------------------------------

PROBLEMS

Problem 1.

  Use the data file on crime rates in Chapter 4.  Print histograms of the crime
  rates for all states combined and for Southern states and non-Southern states
  separately.

  Perform a t-test and nonparametric tests of the hypothesis that the crime
  rates in the Southern states are the same as the crime rates in the non-Southern
  states [Note: this is not actually a proper use of hypothesis testing,
  since this is not a random sample of states.  Perform the test as if
  the assignment to "Southern" and "non-Southern" had been randomly made.]
  Describe your findings.

Problem 2.

  Given the following dataset,


 OBS      GROUP        X          Y
       
   1        1      -0.58149     0.3381
   2        1       0.11909     0.0142
   3        1       0.40898     0.1673
   4        1       1.58229     2.5036
   5        1       0.25558     0.0653
   6        1      -0.50366     0.2537
   7        1       2.56224     6.5651
   8        1       0.01418     0.0002
   9        1       0.89403     0.7993
  10        1      -0.69543     0.4836
  11        1      -0.99360     0.9872
  12        1      -0.14279     0.0204
  13        1      -0.26365     0.0695
  14        1      -1.51597     2.2982
  15        1      -0.39561     0.1565
  16        1       0.62815     0.3946
  17        1       1.20440     1.4506
  18        1      -0.08493     0.0072
  19        1      -0.29970     0.0898
  20        1      -0.07620     0.0058
  21        1      -1.52330     2.3204
  22        1       3.00385     9.0231
  23        1      -0.89299     0.7974
  24        1      -1.43763     2.0668
  25        1      -0.27793     0.0772
  26        1       0.88995     0.7920
  27        1       0.96424     0.9298
  28        1      -0.80702     0.6513
  29        1      -0.33802     0.1143
  30        1      -0.73330     0.5377
  31        1      -0.28173     0.0794
  32        1      -3.60218    12.9757
  33        1      -0.50744     0.2575
  34        1       0.88039     0.7751
  35        1       1.10071     1.2116
  36        1       0.22413     0.0502
  37        1       0.16220     0.0263
  38        1       0.40509     0.1641
  39        1      -0.58761     0.3453
  40        1      -0.94528     0.8935
  41        1       1.73639     3.0151
  42        1       0.44392     0.1971
  43        1       1.80667     3.2641
  44        1       0.02912     0.0008
  45        1      -1.80752     3.2671
  46        1       0.39963     0.1597
  47        1      -1.06043     1.1245
  48        1       0.05343     0.0029
  49        1       0.21036     0.0443
  50        1       0.21532     0.0464
  51        2      -1.77325     3.1444
  52        2      -0.34354     0.1180
  53        2       0.67068     0.4498
  54        2      -1.58006     2.4966
  55        2       2.76764     7.6599
  56        2      -1.99040     3.9617
  57        2      -1.66359     2.7675
  58        2       0.92017     0.8467
  59        2      -0.25762     0.0664
  60        2       1.66384     2.7684
  61        2       1.58245     2.5041
  62        2       1.56187     2.4394
  63        2       0.37986     0.1443
  64        2       0.31203     0.0974
  65        2       2.14525     4.6021
  66        2       0.22862     0.0523
  67        2      -3.81253    14.5354
  68        2      -1.69397     2.8695
  69        2      -0.35756     0.1278
  70        2       2.33153     5.4360
  71        2       2.67642     7.1632
  72        2      -1.62174     2.6300
  73        2       0.29400     0.0864
  74        2       3.27067    10.6973
  75        2       1.37223     1.8830
  76        2      -1.04064     1.0829
  77        2      -0.89953     0.8092
  78        2      -3.75235    14.0801
  79        2      -4.03463    16.2782
  80        2      -1.76834     3.1270
  81        2      -1.15576     1.3358
  82        2       2.38133     5.6707
  83        2      -0.34702     0.1204
  84        2      -0.11424     0.0131
  85        2       0.96115     0.9238
  86        2       1.84926     3.4197
  87        2       1.06257     1.1291
  88        2      -1.69664     2.8786
  89        2       1.74271     3.0370
  90        2       0.29919     0.0895
  91        2       1.75470     3.0790
  92        2      -5.02719    25.2726
  93        2       0.72882     0.5312
  94        2      -2.10171     4.4172
  95        2      -0.83845     0.7030
  96        2      -0.29680     0.0881
  97        2       0.63033     0.3973
  98        2       0.71548     0.5119
  99        2      -0.77999     0.6084
 100        2      -3.00337     9.0202

 (1)  Create histograms for variables X and Y, separately for
      Group 1 and Group 2

 (2)  Create scatterplots of Y versus X, separately for Group 1
      and Group 2.

 (3)  Use the Kolmogorov-Smirnov test to test for whether the
      distribution of Y for Group 1 is the same as that for Group 2.

n54703.002  Last update: January 25, 2004.