## Calculating a Chi-Square p-Value using Estimated Integration (Riemann Sum)

A chi-square (also called chi-squared) test compares a set of observed frequencies with a set of expected/theoretical frequencies. A chi-square test first calculates a chi-square statistic (a value like 8.0) and then uses the statistic, along with a df (degrees of freedom) value to calculate a p-value. The p-value is the area under the chi-square probability density function (p.d.f) to the right of the chi-square statistic value, and is the probability that the observed data matches the expected data.

One standard way to calculate a chi-square p-value is to use ACM Algorithm 299 which in turn uses ACM Algorithm 209 (which calculates the area under the p.d.f of a Normal/Gaussian function). On a recent trip to speak at a conference, I was sitting in an airport and wondered how hard it would be to calculate a chi-square p-value by iterative integrating the chi-square p.d.f. curve (estimating the area by creating many small rectangles and adding their areas up). The technique is sometimes called a Riemann sum.

It turns out that the p.d.f. for the chi-square distribution is defined in terms of the Gamma function. The Gamma function is a crazy complex topic by itself, but in the context of calculating chi-square p-values, you can use a special-case version of Gamma that accepts either an integer argument or an odd integer argument divided by 2, where the argument represents the chi-square degrees of freedom.

To summarize, to compute a chi-square p-value I wanted to integrate (by adding up the areas of a bunch of small rectangles) under the chi-square probability density function. To do this I needed to calculate p.d.f. values (to get the heights of the rectangles). To do that I needed to define a special case Gamma function. And, just to make things a bit more interesting, the special case Gamma function uses the Factorial function (n!) and a variant called the Double Factorial (n!!), and these two factorial functions need big-integer (arbitrarily large) arithmetic.

Well, I put it all together in a C# program. The results of calculating p-values using the Riemann sum integrate technique were just about the same as those using the ACM Algorithm 299 technique. I don’t see any immediate, significant advantage to the integrate technique. The integrate technique code is a bit shorter and a bit easier to understand than the ACM technique, but the integrate technique is much less efficient (because it has to calculate the area of about 5,000 rectangles). Bottom line: the ACM technique is almost certainly preferable in most situations but the integrate technique could possibly be useful in weird scenarios.

Code:

```using System;
using System.Numerics;

namespace ChiSquare
{
class Program
{
static void Main(string[] args)
{
Console.WriteLine("\nBegin Chi-Square p-Value demo\n");
for (double x = 2.0; x <= 9.0; x += 1.0)
{
int df = (int)(x);
double p1 = ChiSquare(x, df);
double p2 = ChiSquare(x, df, 5000);
Console.WriteLine("x = " +
x.ToString("F1") +
" df = " + df);
Console.WriteLine("Chi-square using ACM 299 +
Gauss 209 p-value  =
" + p1.ToString("F8"));
Console.WriteLine("Chi-square using p.d.f.
integration p-value   =
" + p2.ToString("F8"));
Console.WriteLine("");
}

Console.WriteLine("\nEnd demo\n");

} // Main

public static double ChiSquare(double x, int df)
{
// ACM Algorithm 299 and update ACM TOMS June 1985.
. . .

private static double Exp(double x) // ACM update
{
// helper
}

public static double Gauss(double z)
{
// ACM Algorithm #209
. . .
}

public static double ChiSquare(double x, int df, int nIntervals)
{
// 1. compute endpoints of intervals
double[] points = new double[nIntervals + 1];
points[0] = 0.0;
points[nIntervals] = x;
double width = x / nIntervals;
for (int i = 1; i < nIntervals; ++i)
points[i] = i * width;

// integrate
double area = 0.0; // from 0.0 to x
for (int i = 0; i < nIntervals; ++i)
{
double left = points[i];
double right = points[i + 1];
double mid = (left + right) / 2;
double height = ChiSqProbDenFunc(mid, df);
double a = height * width;
area += a;
}

return 1.0 - area;
}

public static BigInteger Fact(int n)
{
BigInteger result = 1;
for (int i = n; i >= 1; i -= 1)
result *= i;
return result;
}

public static BigInteger DoubleFact(int n)
{
BigInteger result = 1;
int stop = 2; // for even n
if (n % 2 != 0) stop = 1;
for (int i = n; i >= stop; i -= 2)
result *= i;
return result;
} // DoubleFact

{
// special version
if (df % 2 == 0) // even so df/2 is an integer
{
double r = (double)Fact(df / 2 - 1);
return r;
}
else // df/2 is not an int
{
double rootPi = Math.Sqrt(Math.PI);
BigInteger top = DoubleFact(df - 2);
double bottom = Math.Pow(2.0, (df - 1) / 2.0);
return rootPi * ((double)top / bottom);
}
}

public static double ChiSqProbDenFunc(double x, int df)
{
double topLeft = Math.Pow(x, (df / 2.0 - 1));
double topRight = Math.Exp(-x / 2.0);
double bottLeft = Math.Pow(2, (df / 2.0));