Calculating a Chi-Square p-Value using Estimated Integration (Riemann Sum)

A chi-square (also called chi-squared) test compares a set of observed frequencies with a set of expected/theoretical frequencies. A chi-square test first calculates a chi-square statistic (a value like 8.0) and then uses the statistic, along with a df (degrees of freedom) value to calculate a p-value. The p-value is the area under the chi-square probability density function (p.d.f) to the right of the chi-square statistic value, and is the probability that the observed data matches the expected data.

ChiSquareByIntegration2

One standard way to calculate a chi-square p-value is to use ACM Algorithm 299 which in turn uses ACM Algorithm 209 (which calculates the area under the p.d.f of a Normal/Gaussian function). On a recent trip to speak at a conference, I was sitting in an airport and wondered how hard it would be to calculate a chi-square p-value by iterative integrating the chi-square p.d.f. curve (estimating the area by creating many small rectangles and adding their areas up). The technique is sometimes called a Riemann sum.

It turns out that the p.d.f. for the chi-square distribution is defined in terms of the Gamma function. The Gamma function is a crazy complex topic by itself, but in the context of calculating chi-square p-values, you can use a special-case version of Gamma that accepts either an integer argument or an odd integer argument divided by 2, where the argument represents the chi-square degrees of freedom.

To summarize, to compute a chi-square p-value I wanted to integrate (by adding up the areas of a bunch of small rectangles) under the chi-square probability density function. To do this I needed to calculate p.d.f. values (to get the heights of the rectangles). To do that I needed to define a special case Gamma function. And, just to make things a bit more interesting, the special case Gamma function uses the Factorial function (n!) and a variant called the Double Factorial (n!!), and these two factorial functions need big-integer (arbitrarily large) arithmetic.

Well, I put it all together in a C# program. The results of calculating p-values using the Riemann sum integrate technique were just about the same as those using the ACM Algorithm 299 technique. I don’t see any immediate, significant advantage to the integrate technique. The integrate technique code is a bit shorter and a bit easier to understand than the ACM technique, but the integrate technique is much less efficient (because it has to calculate the area of about 5,000 rectangles). Bottom line: the ACM technique is almost certainly preferable in most situations but the integrate technique could possibly be useful in weird scenarios.

Code:

using System;
using System.Numerics;

namespace ChiSquare
{
  class Program
  {
    static void Main(string[] args)
    {
      Console.WriteLine("\nBegin Chi-Square p-Value demo\n");
      for (double x = 2.0; x <= 9.0; x += 1.0)
      {
        int df = (int)(x);
        double p1 = ChiSquare(x, df);
        double p2 = ChiSquare(x, df, 5000);
        Console.WriteLine("x = " +
          x.ToString("F1") +
          " df = " + df);
        Console.WriteLine("Chi-square using ACM 299 +
          Gauss 209 p-value  =
          " + p1.ToString("F8"));
        Console.WriteLine("Chi-square using p.d.f. 
           integration p-value   =
           " + p2.ToString("F8"));
        Console.WriteLine("");
      }

      Console.WriteLine("\nEnd demo\n");
      Console.ReadLine(); 

    } // Main

    public static double ChiSquare(double x, int df)
    {
      // ACM Algorithm 299 and update ACM TOMS June 1985.
      . . . 

    private static double Exp(double x) // ACM update 
    {
      // helper 
    }

    public static double Gauss(double z)
    {
      // ACM Algorithm #209
      . . .
    }

    public static double ChiSquare(double x, int df, int nIntervals)
    {
      // 1. compute endpoints of intervals
      double[] points = new double[nIntervals + 1];
      points[0] = 0.0;
      points[nIntervals] = x;
      double width = x / nIntervals;
      for (int i = 1; i < nIntervals; ++i)
        points[i] = i * width;

      // integrate
      double area = 0.0; // from 0.0 to x
      for (int i = 0; i < nIntervals; ++i)
      {
        double left = points[i];
        double right = points[i + 1];
        double mid = (left + right) / 2;
        double height = ChiSqProbDenFunc(mid, df);
        double a = height * width;
        area += a;
      }

      return 1.0 - area;
    }

    public static BigInteger Fact(int n)
    {
      BigInteger result = 1;
      for (int i = n; i >= 1; i -= 1)
        result *= i;
      return result;
    }

    public static BigInteger DoubleFact(int n)
    {
      BigInteger result = 1;
      int stop = 2; // for even n
      if (n % 2 != 0) stop = 1;
      for (int i = n; i >= stop; i -= 2)
        result *= i;
      return result;
    } // DoubleFact

    public static double GammaDfOverTwo(int df)
    {
      // special version 
      if (df % 2 == 0) // even so df/2 is an integer
      {
        double r = (double)Fact(df / 2 - 1);
        return r;
      }
      else // df/2 is not an int
      {
        double rootPi = Math.Sqrt(Math.PI);
        BigInteger top = DoubleFact(df - 2);
        double bottom = Math.Pow(2.0, (df - 1) / 2.0);
        return rootPi * ((double)top / bottom);
      }
    }

    public static double ChiSqProbDenFunc(double x, int df)
    {
      double topLeft = Math.Pow(x, (df / 2.0 - 1));
      double topRight = Math.Exp(-x / 2.0);
      double bottLeft = Math.Pow(2, (df / 2.0));
      double bottRight = GammaDfOverTwo(df);
      return (topLeft * topRight) / (bottLeft * bottRight);
    }

  } // Program

} // ns
Advertisements
This entry was posted in Machine Learning. Bookmark the permalink.