Scott’s pi is one of many classical statistics metrics that can be used to measure how well two raters agree when they rate a set of items. Scott’s pi, like other inter-rater reliability metrics, is used for a very specific problem scenario. I’ll explain by example. Note that assigning a rating is not the same as ranking a set of items from best to worst.

Suppose you have two raters (or judges, or “coders” in classical stats terminology) who rate the quality of life in the 50 states of the U.S. as excellent, good, fair, poor. Your raw data might look like:

# state rater1 rater2 # ------------------------------- Alabama fair good Alaska poor good . . . Wisconsin excellent excellent Wyoming good fair

The first line of data means rater1 judged Alabama as fair, and rater2 judged Alabama as good. A perfect score for Scott’s pi would be 1.000 if both raters agreed exactly on all 50 states. A Scott pi value close to 0.000 means very little agreement.

To summarize, Scott’s pi is applicable if you have exactly two raters, and a bunch of items that are placed into one of a few discrete categories (“nominal data”) by the raters.

I hadn’t really looked at Scott’s pi since my days as a college professor so I refreshed my memory of Scott’s pi by working an example in Excel. The top matrix holds the raw ratings. For example, the 2 in the first row means that there were 2 states where Rater1 assigned Fair and Rater2 assigned Excellent. There are 50 pairs of ratings, which means there were 50 * 2 = 100 decisions made.

Notice that the entries on the diagonal are the number of times that Rater1 and Rater2 agreed. If there was perfect agreement, all the cells off the diagonal would be 0. The P(observed) is the proportion of agreements that actually happened. It’s calculated as the sum of the valueson the diagonal of the raw data, divided by the number of data items (50). For the example, P(observed) = (4 + 6 + 3 + 5) / 50 = 18 / 50 = 0.36. Put another way, the two raters agreed on 36% of the items.

The bottom matix is used to calculate P(expected), which is the proportion of agreements you’d expect if ratings were random.

The first column (10, 17, 12, 11) holds the totals for each category asigned by Rater1. The second column (11, 16, 13, 10) holds the totals for Rater2. The joint proportion (JP) for a category is the sum of the rater totals divided by the total number of decisions made (100). For example, the JP for the Excellent category is (10 + 11) / 100 = 0.21. The fourth column holds squared JP values.

The P(expected) = 0.2596 and it’s calculated as the sum of the squared JP values.

Scott’s pi value compares P(observed) and P(expected) like so:

pi = [P(obs) - P(expected)] / [1 - P(expected)] = (0.36 - 0.2596) / (1 - 0.2596) = 0.1004 / 0.7404 = 0.136

The calculation is not obvious at first, makes sense if you think about it for a bit. Notice that if there is perfect agreement between the two raters, P(observed) will be 1.00 and no matter what P(expected) is, pi = (1.00 – any) / (1 – any) = 1.000.

I found an online inter-rater reliability calculator that does Scott’s pi so I used it to verify my Excel example. The hardest part about using the online calculator was setting up the data file in the correct format. I had to encode Excellent = 4, Good = 3, and so on.

Anyway, good fun. When I get some free time, maybe I’ll code up an implementation using Python. It won’t be difficult — if you can work a problem in Excel, you can almost always translate to a Python program quite easily.

*In general I’m not a fan of pop art style illustrations, but I like these three examples. Left: By artist Chamnan Chongpaiboon. I rate it Excellent. Center: By artist Shreya Bhan. I rate it Good. Right: By artist Michael Eyal. I rate it as Good.*