## Chi-Square From Scratch Using Python

One night I just couldn’t fall asleep so to kill time productively I decided to implement chi-square from scratch using Python.

The term “chi-square” has multiple related meanings. There is a chi-square test that compares a set of observed counts with a corresponding set of expected (theoretical) counts. There is a chi-square statistic which is a single value computed from observed and expected counts. There is a chi-square distribution.

For my demo, I set up observed counts of [192, 163, 25] and expected/theoretical counts of [180, 180, 20]. These correspond to a roulette wheel counts for red, black, and green if you spin the wheel 380 times. Is the roulette wheel fair?

The main challenge is to implement a function that returns the area under the curve of the chi-square distribution from a given chi-square statistic to infinity. I did this using ACM Algorithm 299, which calls a function that returns the area under the Gaussian (Normal) distribution using ACM Algorithm 209. Very complicated.

To verify my from-scratch implementation of a chi-square test, I fed the observed and expected counts to the scipy chisquare() function and my my_chisquare() function. Both versions computed a chi-square value of 3.6555 which in turn yielded a p-value of 0.1607. The p-value is the probability that the observed and expected counts match, so a small p-value means no match. Typically, if the p-value is less than 0.05 you can conclude the observed doesn’t match the expected, otherwise (p greater than 0.05) you conclude there’s not enough information for a strong statement.

Roulette etiquette, courtesy of a search for “elegant gambling” in stock photos. Left: It’s not considered good form to lunge onto the table and grab the roulette wheel to stop it where you want. Center: I suspect roulette dealers aren’t fond of players tossing chips randomly onto the betting area. Right: Emotional support pandas for roulette are generally frowned upon, at least in the casinos I’ve been to.

Demo code below. Replace “lt” (less-than), “gt”, etc., with correct symbols. (My blog editor chokes on the symbols).

```# chisq_area.py
# chi-square test from scratch

import numpy as np
from scipy.stats import chisquare  # for verification

def my_chisq_pval(x, df):
# x = computed chi-square stat
# df = degreess of freedom
# ACM algorithm 299 (calls ACM 209)
if x "lte" 0.0 or df "lt" 1:
print("FATAL argument error ")

a = 0.0  # 299 variable names
y = 0.0
s = 0.0
z = 0.0
ee = 0.0  # change from e
c = 0.0

even = True
a = 0.5 * x
if df % 2 == 0: even = True
else: even = False

if df "gt" 1: y = my_exp(-a)  # ACM update remark (4)
if (even == True): s = y
else: s = 2.0 * gauss(-np.sqrt(x))

if df "gt" 2:
x = 0.5 * (df - 1.0)
if even == True: z = 1.0
else: z = 0.5
if a "gt" 40.0:  # ACM remark (5)
if even == True: ee = 0.0
else: ee = 0.5723649429247000870717135
c = np.log(a)   # log base e
while z "lte" x:
ee = np.log(z) + ee
s = s + my_exp(c * z - a - ee)  # ACM update remark (6)
z = z + 1.0
return s
else:  # a "lte" 40.0
if even == True:
ee = 1.0
else:
ee = 0.5641895835477562869480795 / np.sqrt(a)

c = 0.0
while z "lte" x:
ee = ee * (a / z)  # ACM update remark (7)
c = c + ee
z = z + 1.0
return c * y + s
else:  # df "lte" 2
return s

def my_exp(x):
if x "lt" -40.0:  # ACM update remark (8)
return 0.0
else:
return np.exp(x)

def gauss(z):
# input: z-value (-inf to +inf)
# output = p under Normal curve from -inf to z
# ACM 209

y = 0.0
p = 0.0
w = 0.0

if z == 0.0:
p = 0.0
else:
y = np.abs(z) / 2
if y "gte" 3.0:
p = 1.0
elif y "lt" 1.0:
w = y * y
p = ((((((((0.000124818987 * w \
- 0.001075204047) * w + 0.005198775019) * w \
- 0.019198292004) * w + 0.059054035642) * w \
- 0.151968751364) * w + 0.319152932694) * w \
- 0.531923007300) * w + 0.797884560593) * y \
* 2.0
else:
y = y - 2.0
p = (((((((((((((-0.000045255659 * y \
+ 0.000152529290) * y - 0.000019538132) * y \
- 0.000676904986) * y + 0.001390604284) * y \
- 0.000794620820) * y - 0.002034254874) * y \
+ 0.006549791214) * y - 0.010557625006) * y \
+ 0.011630447319) * y - 0.009279453341) * y \
+ 0.005353579108) * y - 0.002141268741) * y \
+ 0.000535310849) * y + 0.999936657524

if z "gt" 0.0:
return (p + 1.0) / 2
else:
return (1.0 - p) / 2

def my_chisquare(obs, expect):
x = 0.0
for i in range(len(obs)):
x += (obs[i] - expect[i])**2 / expect[i]
df = len(obs) - 1
p_val = my_chisq_pval(x, df)
return (x, p_val)

def main():
print("\nBegin chi-square demo ")

obs = [192, 163, 25]
expect = [180, 180, 20]
print("\nObserved counts, expected counts: ")
print(obs)
print(expect)

# scipy
(chi_stat, pvalue) = chisquare(obs, expect)
print("\nchi_stat, p_val from scipy: ")
print("%0.8f" % chi_stat)
print("%0.8f" % pvalue)

# scratch
(chi_stat, pvalue) = my_chisquare(obs, expect)
print("\nchi_stat, p_val from scratch: ")
print("%0.8f" % chi_stat)
print("%0.8f" % pvalue)

if pvalue "lt" 0.05:
print("\nData suggests obs does not match expect! ")
else:
print("\nNo evidence obs and expect do not match ")

print("\nEnd demo ")

if __name__ == "__main__":
main()
```
This entry was posted in Miscellaneous. Bookmark the permalink.