The Area under the F-Distribution using C#

In statistics, there is a test called ANOVA, which stands for Analysis of Variance. The idea is that if you have three or more populations of data (for example math ability of high school students in three different schools) and you only have samples from each (say test scores from 30 students in each school because measuring the ability of all students isn’t feasible), and you want to infer if the true average score of the three schools is the same or not.

You take the 30 scores from each school and compute what’s called an F statistic. Then you calculate the area under a function called the F-distribution from 0.0 to the F statistic, then take 1 minus that area. The result is the probability that the three population averages are the same.

The hard part is calculating the area under the F-distribution. It’s one of the most difficult problems in statistics. Stats libraries that compute the area under F typically have many hundreds of lines of extremely complex code. Examples include the R language (which actually calls the Cephes library) and the GNU scientific library code.

AreaUnderFDistribution

Just for fun, I thought I’d see if I could write some C# code to approximate the area under F. Even a crude approximation turned out to be extremely tricky. I called my main method PF() to mimic the R language pf() function. The PF() method calls method BetaInc() that approximates the incomplete beta function. The BetaInc() method calls methods BetaIncCf() and LogGamma() that approximate the incomplete beta using a continued fraction and approximate the log of gamma using the Lanczos algorithm.

If this sounds complicated, it is. But this approach is vastly less complex than a rigorous approach — if you don’t believe me just take a look at the source code for the R pf() function.

Anyway, I compared my C# approximation results with the R pf() function results. The C# results were the same as the R results to 8 decimals on 27 test points (I’d really need to do many hundreds of additional test cases to have confidence in my approximation).

using System;
namespace FDistribution
{
  class Program
  {
    static void Main(string[] args)
    {
      Console.WriteLine("\n Begin F-distribution \n");

      // computed using R pf() function
      double[] r_results = new double[] { 0.06345103,
        0.50000000,0.73227953,
        0.07334765,0.60899778,
        0.88863285,0.07866006,
        0.67074342,0.96309516,
        0.00212840,0.39100222,
        0.68503764,0.00166753,
        0.50000000,0.89044898,
        0.00141832,0.58674809,
        0.99048966,0.00000000,
        0.32925658,0.65952651,
        0.00000000,0.41325191,
        0.89514489,0.00000000,
        0.50000000,0.99964824 };

      int[] df1 = new int[] { 1, 3, 20};
      int[] df2 = new int[] { 1, 3, 20};
      double[] ff = new double[] { 0.01, 1.0, 5.0 };


      Console.WriteLine(" R result      C# result");
      Console.WriteLine("========================");
      int ctr = 0;
      for (int i = 0; i < df1.Length; ++i)
      {
        for (int j = 0; j < df2.Length; ++j)
        {
          for (int k = 0; k < ff.Length; ++k)
          {
            double p = PF(df1[i], df2[j], ff[k]);
            Console.WriteLine(" " +
              r_results[ctr++].ToString("F8") +
              "    " + p.ToString("F8"));
          }
        } // j
      } // i

      Console.WriteLine("\n End \n");
      Console.ReadLine();

    } // Main

    
    static double LogGamma(double x)
    {
      // Log of Gamma from Lanczos with g=5, n=6/7
      double[] coef = new double[6] { 76.18009172947146,
        -86.50532032941677, 24.01409824083091,
        -1.231739572450155, 0.1208650973866179E-2,
        -0.5395239384953E-5 };
      double LogSqrtTwoPi = 0.91893853320467274178;
      double denom = x + 1;
      double y = x + 5.5;
      double series = 1.000000000190015;
      for (int i = 0; i < 6; ++i)
      {
        series += coef[i] / denom;
        denom += 1.0;
      }
      return (LogSqrtTwoPi + (x + 0.5) * Math.Log(y) -
      y + Math.Log(series / x));
    }


    static double BetaInc(double a, double b, 
      double x)
    {
      // Incomplete Beta function
      // A & S 6.6.2 and 26.5.8
      double bt;
      if (x == 0.0 || x == 1.0)
        bt = 0.0;
      else
        bt = Math.Exp(LogGamma(a + b) - LogGamma(a) -
          LogGamma(b) + a * Math.Log(x) + b *
          Math.Log(1.0 - x));

      if (x < (a + 1.0) / (a + b + 2.0))
        return bt * BetaIncCf(a, b, x) / a;
      else
        return 1.0 - bt * BetaIncCf(b, a, 1.0 - x) / b;
    }

    static double BetaIncCf(double a, double b,
      double x)
    {
      // Approximate Incomplete Beta computed by
      // continued fraction
      // A & S 26.5.8 
      int max_it = 100;
      double epsilon = 3.0e-7;
      double small = 1.0e-30;

      int m2; // 2*m
      double aa, del;

      double qab = a + b;
      double qap = a + 1.0;
      double qam = a - 1.0;
      double c = 1.0;
      double d = 1.0 - qab * x / qap;
      if (Math.Abs(d) < small) d = small;
      d = 1.0 / d;
      double result = d;

      int m;
      for (m = 1; m <= max_it; ++m)
      {
        m2 = 2 * m;
        aa = m * (b - m) * x / ((qam + m2) *
          (a + m2));
        d = 1.0 + aa * d;
        if (Math.Abs(d) < small) d = small;
        c = 1.0 + aa / c;
        if (Math.Abs(c) < small) c = small;
        d = 1.0 / d;
        result *= d * c;
        aa = -(a + m) * (qab + m) * x / ((a + m2) *
          (qap + m2));
        d = 1.0 + aa * d;
        if (Math.Abs(d) < small) d = small;
        c = 1.0 + aa / c;
        if (Math.Abs(c) < small) c = small;
        d = 1.0 / d;
        del = d * c;
        result *= del;
        if (Math.Abs(del - 1.0) < epsilon) break;
      }
        if (m > max_it)
          throw new Exception("BetaIncCf() failure ");
      return result;
    } // BetaCf

    static double PF(double a, double b, double x)
    {
      // approximate lower tail of F-dist
      // (area from 0.0 to x)
      // equivalent to the R pf() function
      // only accurate to about 3 decimals
      double z = (a * x) / (a * x + b);
      return   BetaInc(a / 2, b / 2, z);
    }
  } // Program
} // ns

Advertisements
This entry was posted in Machine Learning. Bookmark the permalink.