Thursday, June 28, 2012

A better fairness test

The fairness test that in my previous post had two problems:

  1. It was not sensitive to results other than first place.
  2. It could reject the null hypothesis that the motion is "fair" simply when one team's results had a lower variance.  Since it's at least folk wisdom that Opening Government teams typically have lower variance in their results, it could reject motions as unfair for reasons that are endemic to the format, and nothing to do with that particular motion.
After a bit of thought, I instead propose the following alternative:
A motion is fair iff the expected number of team points for a team in any position is 1.5.

This has the useful quality that it is sensitive to all the ranks that a team could attain, not just first place.  It also avoids mis-identifying variance issues as unfairness issues.  It seems, to me at least, that a particular position having a higher variance in its results does not ipso facto make a motion unfair.  It also seems to me a feasible standard, in that CA teams could reach it (with lots of effort, critical thinking, and a little luck).  (It is certainly more feasible than the extreme alternative, that teams in any given position must have an equal chance of coming first, second, third or fourth.)

What follows is a cookbook, so that any interested party could implement this test.  I assume a very basic knowledge of matrix algebra.  If you're not interested in the maths, feel free to skip to the end.


Cookbook

Let there be N rooms, labelled i = 1,...,N.  Let Xi be the 3x1 column vector of the team points attained by the OG, OO, and CG in room i.

Assume Xi is independently and identically distributed across rooms, with mean μ and variance-covariance matrix V.  Define
By the Multivariate Central Limit Theorem, and an appropriate Law of Large Numbers, it follows that:
We want to test the hypothesis that teams in every position have the same expected team points.  This is equivalent to the hypothesis that
Of course, since the points in every room add up to six by necessity, this is entails that CO has 1.5 team points in expectation.) Our alternative is that μ is not equal to this stacked vector of 1.5's.  Define the test statistic z as follows:
Under the null hypothesis, z converges asymptotically to a Chi-squared distribution with three degrees of freedom.  As before, one can simply compute z, consult the appropriate table of critical values, and check whether z exceeds the critical value for the chosen significance level.

Caveats

All the caveats in my previous post apply.  Most importantly, this is a large sample procedure, and will not (as a rule of thumb) yield valid statistical inference for competitions smaller than twenty rooms.  If, for some reason, you believe that BP debating is systematically unbalanced (regardless of motion) in some subset of rooms, then you could drop such rooms from the sample.  However, if you report multiple hypothesis tests for a single motion, using various sub-samples, it would be wise to apply a Bonferroni correction.

So What?

Having made your way this far, I'd now like to answer the obvious question: Why should we care?

We should care about having a rigorous test of motion fairness because it would allow us to have better discussions about motion fairness after a competition.  Sometimes CA teams set unfair motions.  Sometimes CA teams set fair motions, but get unlucky, and the results are skewed against them.  Having a scientific test for motion fairness allows us to answer the question:  If the motion was fair, how improbable would it be for us to observe results that were at least as skewed as this?  For instance, if the null hypothesis is rejected at the 5% significance level, then we know that, if the motion was fair, then we'd have a less than 5% chance of seeing results that were so skewed.

This means that teams can more accurately gauge whether their performance was advantaged or handicapped by the motion in question (though it goes without saying that debaters who improve don't blame the motion).  It would also help CA teams be honest with ourselves; it's often difficult to identify one's own mistakes, and having an objective gauge can help motion-setters to acknowledge errors and improve over time.  In general, it would be great for transparency and accountability.  Indeed, often it may turn out the other way around; we may find that, even though the results look too skewed for the motion to be fair, the test statistic (which is a function of sample size) reports that we can't reject the hypothesis that the motion is fair; i.e. that the data are consistent with the hypothesis that the motion is fair.  This would, at least, help us realise the limitations of our own data, and reduce undeserved criticism.

1 comment: