Cookbook
What follows is a likelihood-ratio test for the null hypothesis that a given debate motion was fair. Here I take a particular (and contestable) definition of fairness: Namely, that teams in any given position have an equal chance of coming first.
Suppose there are N rooms, numbered i = 1,...,N. Label the four positions (OG, OO, CG, CO) j=1,2,3,4.
Let bi,j be equal to 1 if the team in room i, in position j, came first, and equal to 0 otherwise. bj be the sample average of bi,j. That is, let bj = ∑i bi,j/N . Consider the statistic z, defined thus:
Under the null hypothesis that teams in a given position have an equal chance of coming first, z has an asymptotic Chi-squared distribution with 3 degrees of freedom.
What this means is that, to conduct a test at the α-significance level, one need only compute z, and then compare its magnitude to the relevant entry in a table of chi-squared critical values. If it exceeds the relevant critical value, then we reject the null hypothesis at the α-significance level.
Intuitively, what's going on is this: If the ratio of teams taking a first from a given position is skewed far away from 0.25, then the statistic z gets larger. This test gives us a (moderately) rigorous way to assess whether the results are too skewed to be attributed to chance; i.e. whether the data are consistent with the hypothesis that the motion was fair.
Caveats
Suppose there are N rooms, numbered i = 1,...,N. Label the four positions (OG, OO, CG, CO) j=1,2,3,4.
Let bi,j be equal to 1 if the team in room i, in position j, came first, and equal to 0 otherwise. bj be the sample average of bi,j. That is, let bj = ∑i bi,j/N . Consider the statistic z, defined thus:
What this means is that, to conduct a test at the α-significance level, one need only compute z, and then compare its magnitude to the relevant entry in a table of chi-squared critical values. If it exceeds the relevant critical value, then we reject the null hypothesis at the α-significance level.
Intuitively, what's going on is this: If the ratio of teams taking a first from a given position is skewed far away from 0.25, then the statistic z gets larger. This test gives us a (moderately) rigorous way to assess whether the results are too skewed to be attributed to chance; i.e. whether the data are consistent with the hypothesis that the motion was fair.
Caveats
- This test, of course, uses only one of many possible definitions "fairness". If there's wider interest, it would be simple to construct test statistics for other definitions of fairness, so long as they can be expressed in mathematical terms. (Addendum: I've written up a test for what I think is a better definition of fairness here.)
- This test is a large-sample procedure. z converges to a chi-squared distribution as the sample size gets large. If your sample size is small (say, if you are running a ten-room competition), then using this statistic will yield invalid statistical inference. I'd hesitate to use this test at competitions with fewer than twenty rooms.
- It is inevitable in statistical testing that sometimes true null hypotheses are rejected. If you test, for instance, at the 5% significance level, then there is a 5% chance of a Type I Error. This means that, when the motion is completely fair, there is still a 5% chance the test cries foul. If every motion at a competition is completely fair, you should (on average) see one in twenty motions get rejected as unfair. You could reduce the chance of a Type I error by testing at a more demanding significance level. But that would reduce the power of the test, which makes it less able to detect motions that are actually unfair.
- Other things being equal, this test will have more power to detect unfair motions as the sample size grows. This means that, even if the CA teams of small competitions and the CA teams of large competitions are equally good at setting fair motions, we should expect to see rejections of the null hypothesis more often at large competitions. This, of course, does not mean that large competitions have worse CA teams; it merely reflects the fact that large competitions yield more data, which enables this test to be more discerning.
- The way this likelihood ratio test statistic is formulated does not take into account the possibility of serial correlation in the stochastic process for motions. It assumes that, coming into each round, every position (OG, OO, CG, CO) in every room is equally likely to be occupied by a team of a given quality. The way tab software works means that this is probably a good approximation; but I can think of exceptions to this.
Very good write-up, you've given good warnings for the drawbacks of the test. Which "more ambitious test" were you originally considering?
ReplyDeleteShen Ting: The more ambitious test was that the probability of a team in any given position taking any given rank was 1/4.
ReplyDeleteWell it depends on how "fair" you want to be. Some people will say that only the winners are remembered and hence what you've proposed is sufficiently "fair".
ReplyDeleteWith regards to fairness, do you think it's worth looking at: probability the strongest team wins?