One night I just couldn’t fall asleep so to kill time productively I decided to implement chi-square from scratch using Python.
The term “chi-square” has multiple related meanings. There is a chi-square test that compares a set of observed counts with a corresponding set of expected (theoretical) counts. There is a chi-square statistic which is a single value computed from observed and expected counts. There is a chi-square distribution.
For my demo, I set up observed counts of [192, 163, 25] and expected/theoretical counts of [180, 180, 20]. These correspond to a roulette wheel counts for red, black, and green if you spin the wheel 380 times. Is the roulette wheel fair?
The main challenge is to implement a function that returns the area under the curve of the chi-square distribution from a given chi-square statistic to infinity. I did this using ACM Algorithm 299, which calls a function that returns the area under the Gaussian (Normal) distribution using ACM Algorithm 209. Very complicated.
To verify my from-scratch implementation of a chi-square test, I fed the observed and expected counts to the scipy chisquare() function and my my_chisquare() function. Both versions computed a chi-square value of 3.6555 which in turn yielded a p-value of 0.1607. The p-value is the probability that the observed and expected counts match, so a small p-value means no match. Typically, if the p-value is less than 0.05 you can conclude the observed doesn’t match the expected, otherwise (p greater than 0.05) you conclude there’s not enough information for a strong statement.
Roulette etiquette, courtesy of a search for “elegant gambling” in stock photos. Left: It’s not considered good form to lunge onto the table and grab the roulette wheel to stop it where you want. Center: I suspect roulette dealers aren’t fond of players tossing chips randomly onto the betting area. Right: Emotional support pandas for roulette are generally frowned upon, at least in the casinos I’ve been to.
Demo code below. Replace “lt” (less-than), “gt”, etc., with correct symbols. (My blog editor chokes on the symbols).
# chisq_area.py # chi-square test from scratch import numpy as np from scipy.stats import chisquare # for verification def my_chisq_pval(x, df): # x = computed chi-square stat # df = degreess of freedom # ACM algorithm 299 (calls ACM 209) if x "lte" 0.0 or df "lt" 1: print("FATAL argument error ") a = 0.0 # 299 variable names y = 0.0 s = 0.0 z = 0.0 ee = 0.0 # change from e c = 0.0 even = True a = 0.5 * x if df % 2 == 0: even = True else: even = False if df "gt" 1: y = my_exp(-a) # ACM update remark (4) if (even == True): s = y else: s = 2.0 * gauss(-np.sqrt(x)) if df "gt" 2: x = 0.5 * (df - 1.0) if even == True: z = 1.0 else: z = 0.5 if a "gt" 40.0: # ACM remark (5) if even == True: ee = 0.0 else: ee = 0.5723649429247000870717135 c = np.log(a) # log base e while z "lte" x: ee = np.log(z) + ee s = s + my_exp(c * z - a - ee) # ACM update remark (6) z = z + 1.0 return s else: # a "lte" 40.0 if even == True: ee = 1.0 else: ee = 0.5641895835477562869480795 / np.sqrt(a) c = 0.0 while z "lte" x: ee = ee * (a / z) # ACM update remark (7) c = c + ee z = z + 1.0 return c * y + s else: # df "lte" 2 return s def my_exp(x): if x "lt" -40.0: # ACM update remark (8) return 0.0 else: return np.exp(x) def gauss(z): # input: z-value (-inf to +inf) # output = p under Normal curve from -inf to z # ACM 209 y = 0.0 p = 0.0 w = 0.0 if z == 0.0: p = 0.0 else: y = np.abs(z) / 2 if y "gte" 3.0: p = 1.0 elif y "lt" 1.0: w = y * y p = ((((((((0.000124818987 * w \ - 0.001075204047) * w + 0.005198775019) * w \ - 0.019198292004) * w + 0.059054035642) * w \ - 0.151968751364) * w + 0.319152932694) * w \ - 0.531923007300) * w + 0.797884560593) * y \ * 2.0 else: y = y - 2.0 p = (((((((((((((-0.000045255659 * y \ + 0.000152529290) * y - 0.000019538132) * y \ - 0.000676904986) * y + 0.001390604284) * y \ - 0.000794620820) * y - 0.002034254874) * y \ + 0.006549791214) * y - 0.010557625006) * y \ + 0.011630447319) * y - 0.009279453341) * y \ + 0.005353579108) * y - 0.002141268741) * y \ + 0.000535310849) * y + 0.999936657524 if z "gt" 0.0: return (p + 1.0) / 2 else: return (1.0 - p) / 2 def my_chisquare(obs, expect): x = 0.0 for i in range(len(obs)): x += (obs[i] - expect[i])**2 / expect[i] df = len(obs) - 1 p_val = my_chisq_pval(x, df) return (x, p_val) def main(): print("\nBegin chi-square demo ") obs = [192, 163, 25] expect = [180, 180, 20] print("\nObserved counts, expected counts: ") print(obs) print(expect) # scipy (chi_stat, pvalue) = chisquare(obs, expect) print("\nchi_stat, p_val from scipy: ") print("%0.8f" % chi_stat) print("%0.8f" % pvalue) # scratch (chi_stat, pvalue) = my_chisquare(obs, expect) print("\nchi_stat, p_val from scratch: ") print("%0.8f" % chi_stat) print("%0.8f" % pvalue) if pvalue "lt" 0.05: print("\nData suggests obs does not match expect! ") else: print("\nNo evidence obs and expect do not match ") print("\nEnd demo ") if __name__ == "__main__": main()
You must be logged in to post a comment.