A chi-square (also called chi-squar*ed*) test compares a set of observed frequencies with a set of expected/theoretical frequencies. A chi-square test first calculates a chi-square statistic (a value like 8.0) and then uses the statistic, along with a df (degrees of freedom) value to calculate a p-value. The p-value is the area under the chi-square probability density function (p.d.f) to the right of the chi-square statistic value, and is the probability that the observed data matches the expected data.

One standard way to calculate a chi-square p-value is to use ACM Algorithm 299 which in turn uses ACM Algorithm 209 (which calculates the area under the p.d.f of a Normal/Gaussian function). On a recent trip to speak at a conference, I was sitting in an airport and wondered how hard it would be to calculate a chi-square p-value by iterative integrating the chi-square p.d.f. curve (estimating the area by creating many small rectangles and adding their areas up). The technique is sometimes called a Riemann sum.

It turns out that the p.d.f. for the chi-square distribution is defined in terms of the Gamma function. The Gamma function is a crazy complex topic by itself, but in the context of calculating chi-square p-values, you can use a special-case version of Gamma that accepts either an integer argument or an odd integer argument divided by 2, where the argument represents the chi-square degrees of freedom.

To summarize, to compute a chi-square p-value I wanted to integrate (by adding up the areas of a bunch of small rectangles) under the chi-square probability density function. To do this I needed to calculate p.d.f. values (to get the heights of the rectangles). To do that I needed to define a special case Gamma function. And, just to make things a bit more interesting, the special case Gamma function uses the Factorial function (n!) and a variant called the Double Factorial (n!!), and these two factorial functions need big-integer (arbitrarily large) arithmetic.

Well, I put it all together in a C# program. The results of calculating p-values using the Riemann sum integrate technique were just about the same as those using the ACM Algorithm 299 technique. I don’t see any immediate, significant advantage to the integrate technique. The integrate technique code is a bit shorter and a bit easier to understand than the ACM technique, but the integrate technique is much less efficient (because it has to calculate the area of about 5,000 rectangles). Bottom line: the ACM technique is almost certainly preferable in most situations but the integrate technique could possibly be useful in weird scenarios.

Code:

using System; using System.Numerics; namespace ChiSquare { class Program { static void Main(string[] args) { Console.WriteLine("\nBegin Chi-Square p-Value demo\n"); for (double x = 2.0; x <= 9.0; x += 1.0) { int df = (int)(x); double p1 = ChiSquare(x, df); double p2 = ChiSquare(x, df, 5000); Console.WriteLine("x = " + x.ToString("F1") + " df = " + df); Console.WriteLine("Chi-square using ACM 299 + Gauss 209 p-value = " + p1.ToString("F8")); Console.WriteLine("Chi-square using p.d.f. integration p-value = " + p2.ToString("F8")); Console.WriteLine(""); } Console.WriteLine("\nEnd demo\n"); Console.ReadLine(); } // Main public static double ChiSquare(double x, int df) { // ACM Algorithm 299 and update ACM TOMS June 1985. . . . private static double Exp(double x) // ACM update { // helper } public static double Gauss(double z) { // ACM Algorithm #209 . . . } public static double ChiSquare(double x, int df, int nIntervals) { // 1. compute endpoints of intervals double[] points = new double[nIntervals + 1]; points[0] = 0.0; points[nIntervals] = x; double width = x / nIntervals; for (int i = 1; i < nIntervals; ++i) points[i] = i * width; // integrate double area = 0.0; // from 0.0 to x for (int i = 0; i < nIntervals; ++i) { double left = points[i]; double right = points[i + 1]; double mid = (left + right) / 2; double height = ChiSqProbDenFunc(mid, df); double a = height * width; area += a; } return 1.0 - area; } public static BigInteger Fact(int n) { BigInteger result = 1; for (int i = n; i >= 1; i -= 1) result *= i; return result; } public static BigInteger DoubleFact(int n) { BigInteger result = 1; int stop = 2; // for even n if (n % 2 != 0) stop = 1; for (int i = n; i >= stop; i -= 2) result *= i; return result; } // DoubleFact public static double GammaDfOverTwo(int df) { // special version if (df % 2 == 0) // even so df/2 is an integer { double r = (double)Fact(df / 2 - 1); return r; } else // df/2 is not an int { double rootPi = Math.Sqrt(Math.PI); BigInteger top = DoubleFact(df - 2); double bottom = Math.Pow(2.0, (df - 1) / 2.0); return rootPi * ((double)top / bottom); } } public static double ChiSqProbDenFunc(double x, int df) { double topLeft = Math.Pow(x, (df / 2.0 - 1)); double topRight = Math.Exp(-x / 2.0); double bottLeft = Math.Pow(2, (df / 2.0)); double bottRight = GammaDfOverTwo(df); return (topLeft * topRight) / (bottLeft * bottRight); } } // Program } // ns