stata and r ranksum test p-values differ

Introduction

A few months ago a clinical scholar presented a situation where we thought that a non-parametric two-sample test would be suitable: the sample sizes were small, and we were not confident about the outcome distribution. The scholar pointed out to me, to my surprise that the p-value we got from R was different from Stata. How could that happen for such a commonly-used test?

Which one is right?

The data

Here are the data (ranks and group). The sample size is 11, with groups of size 5 and 6.

We both used the ranksum test separately on the data in R and Stata.

Stata output

Here is what you get from Stata.

. insheet using data.csv
. ranksum rank, by(group)

Two-sample Wilcoxon rank-sum (Mann-Whitney) test

       group |      obs    rank sum    expected
-------------+---------------------------------
           0 |        5          39          30
           1 |        6          27          36
-------------+---------------------------------
    combined |       11          66          66

unadjusted variance       30.00
adjustment for ties        0.00
                     ----------
adjusted variance         30.00

Ho: rank(group==0) = rank(group==1)
             z =   1.643
    Prob > |z| =   0.1003

R output

Here's what you get from R.

> x <- read.csv("data.csv",header=T)
> wilcox.test(x$rank~x$group)

  Wilcoxon rank sum test

data:  x$rank by x$group
W = 24, p-value = 0.1255
alternative hypothesis: true location shift is not equal to 0

We have a p-value of 0.1255 from R and 0.1003 from Stata.

Which one is right?

We performed a permutation test on the Wilcoxon ranksum statistic (the two-sided version). After 100,000 permutations we got a p-value of 0.1259, which, you will notice, is a lot closer to R's value of 0.1255 than the Stata value of 0.1003.

Permutation test

. permute rank W=(r(sum_obs)-r(sum_exp)), reps(100000) nodots: ranksum rank, by(group)

Monte Carlo permutation results                   Number of obs   =         11

      command:  ranksum rank, by(group)
            W:  r(sum_obs)-r(sum_exp)
  permute var:  rank

------------------------------------------------------------------------------
T            |     T(obs)       c       n   p=c/n   SE(p) [95% Conf. Interval]
-------------+----------------------------------------------------------------
           W |          9   12586  1.0e+05 0.1259  0.0010  .1238101   .1279317
------------------------------------------------------------------------------
Note:  confidence interval is with respect to p=c/n.
Note:  c = #{|T| >= |T(obs)|}

How does Stata calculate the p-value?

This was still not completely satisfying, since it is puzzling that Stata would give a p-value that seems off. Reading the documentation, it seemed that Stata uses the large sample approximation for ranksum test p-values, although that is not clearly specified.

> wilcox.test(x$rank~x$group,exact=FALSE,correct=FALSE)

        Wilcoxon rank sum test

data:  x$rank by x$group
W = 24, p-value = 0.1003
alternative hypothesis: true location shift is not equal to 0

When we turn off both the exact p-value calculation, and the continuity correction in R, we get a number that matches Stata.

Why does Stata use the normal approximation?

I can only speculate on this. Interestingly, when I looked at the several textbooks in my possession, including classics such as Cox and Hinkley (1979) and Lehmann (1986), I found out that they all mention the normal approximation for calculating p-values. This made sense when the books were written, when computing was expensive (or had to be done at hand). Today, however, I think, this does not make sense. The ranksum test is most useful when sample sizes are small and that's when the exact p-value is most needed.

Notes on the exact p-value calculation

It's an interesting mathematical and computational problem for statistics graduate students to consider as it involves both theoretical statistics and computation. I found a recursion-based calculation of the tail probailities in Rohatgi (1976). The recursion relation appears in the Mann and Whitney (1947) paper, and is the method that R uses to calculate the exact p-value. It works fine for small sample sizes, when they are most needed. However, when the sample sizes for both groups is large, the recursion relation is too slow. For some applications such as in high-dimensional data situations, numerical accuracy and speed are considerations. Roundoff errors from recursion can accumulate, and can impact small p-values. See Nagarajan and Keich (2009) for a discussion of these issues and proposed method for such situations.

Recommendations for Stata users

Until Stata fixes this issue, the easiest solution is to download a package (Harris and Hardin, 2013) from the Stata Journal as shown below. The paper appeared only last year, so I assume Stata is in the process of fixing the problem.

The other alternative is to use the permutation test as shown above, or to use R.

. search ranksum
. net sj SJ13-2
. net describe st0297
. net install st0297
. help ranksumex
. ranksumex rank, by(group)

Two-sample Wilcoxon rank-sum (Mann-Whitney) test

       group |      obs    rank sum    expected
-------------+---------------------------------
           0 |        5          39          30
           1 |        6          27          36
-------------+---------------------------------
    combined |       11          66          66

Exact statistics
Ho: rank(group==0) = rank(group==1)
    Prob <=        21 = 0.0628
    Prob >=        39 = 0.0628
    Two-sided p-value = 0.1255

Thanks

To Anne Schafer for bringing this issue to my attention and to Karl Broman for helpful discussion.

References

Cox DR, Hinkley DV (1979) Theoretical statistics. Chapman and Hall.
Lehmann EL (1986) Testing statistical hypotheses. John Wiley and Sons.
Rohatgi VK (1976) An introduction to probability theory and mathematical statistics. John Wiley and Sons.
Mann HB, Whitney DR (1947) On a test of whether one of two random variables is stochastically larger than the other. Annals of Mathematical Statistics: 18(1):50-60. 1947
Nagarajan N, Keich U (2009) Reliability and efficiency of algorithms for computing the significance of the Mann-Whitney test. Computational Statistics: 24(4):605-622.
Wilcoxon F (1945) Individual comparisons by ranking methods. Biometrics 1:80-83.
Harris T, Hardin JW (2013) Exact Wilcoxon signed-rank and Wilcoxon Mann-Whitney ranksum tests. Stata Journal 13(2):337-343.