ANOVA with R

I wrote an article in the May 2016 issue of Visual Studio Magazine titled “Classic Stats, Or What ANOVA with R Is All About”. See

In statistics because there are hundreds of common techniques (and thousands of rare techniques), one of the biggest challenges is knowing what technique to use. ANOVA, which stands for “analysis of variance” is used when you want to infer if the averages (means) of three or more groups of data are the same (or not), in situations where you only have samples of the groups.


In my article I set up a hypothetical demo problem where there are three different English classes at a large high school. Each English class uses a different textbook and a different teaching strategy. After the classes finish, if you could, you’d test all the students in the three classes, but because the hypothetical test is expensive and time consuming, you can only give the test to a few students in each class.

In the demo, the sample test scores are analyzed, giving an “F statistic” which in turn gives a “p-value” of 0.004. This is the probability that the true means of the source groups (all English students, not just the samples) are the same. Here, the p-value is small (where small means less than 0.05) so you’d conclude that the means are not all the same, or in other words, at least one of the class means is different, or in other words, maybe textbook and teaching technique have an effect.

Three important things. First, the term analysis of variance is somewhat confusing – behind the scenes, the technique uses mathematical variance to make an inference about means. Second, the results are just probabilities not certainties. Third, ANOVA makes certain math assumptions that you can’t entirely check in practice.

Posted in Machine Learning, R Language | Leave a comment

The 2016 Interop Conference Expo

I spoke at the 2016 Interop Conference. The event is billed as the largest computer IT and networking conference in the world. It is a big conference; I estimate there are about 5,000 total attendees.


A big part of Interop is the large Expo. There are about 200 companies represented, with everything from the big guys like Cisco, Dell, and IBM, to small startup companies. Products and services included just about everything you can imagine that’s related to computer networks.


There’s a lot of energy at the Expo and I love looking at the booths – it gives me a good feel for what’s new in the field, and often gives me useful ideas I can bring back to my work.


Based on my observations at Interop 2016, I conclude that 1.) the single biggest topic is security, 2.) there’s a lot of consolidation in the IT industry, but, 3.) there’s still room for small, niche companies to be successful.

Posted in Conferences | Leave a comment

The Multi-Armed Bandit Problem

I wrote an article in the May 2016 issue of Microsoft MSDN Magazine, titled ”The Multi-Armed Bandit Problem”. See

Imagine you’re in Las Vegas, standing in front of three slot machines. You have 20 tokens to use, where you drop a token into any of the three machines, pull the handle and are paid a random amount. The machines pay out differently, but you initially have no knowledge of what kind of payout schedules the machines follow. What strategies can you use to try and maximize your gain?

This is an example of what’s called the multi-armed bandit problem, so named because a slot machine is informally called a one-armed bandit. The problem is not as whimsical as it might first seem. There are many important real-life problems, such as drug clinical trials, that are similar to the slot machine example.


In my article I present a short but complete demo program written in C#. There are several algorithms that can be used on the multi-armed bandit problem. My demo uses the simplest reasonable algorithm, which is called explore-exploit.

The multi-armed bandit problem — an interesting combination of math, economics, and computer science.

Posted in Machine Learning | Leave a comment

R Language Vectors vs. Arrays vs. Lists vs. Matrices vs. Data Frames

The R language has five basic data structures. In order from simplest to complex: vectors, lists, matrices, arrays, data frames. Even though R has been around for decades, I see many questions on the Internet about the differences between these five structures. The confusion is due to R’s bizarre naming that differs from all mainstream languages.


Briefly, and with some details left out:

A vector is what is called an array in all other programming languages except R — a collection of cells with a fixed size where all cells hold the same type (integers or characters or reals or whatever).

A list can hold items of different types and the list size can be increased on the fly. List contents can be accessed either by index (like mylist[[1]]) or by name (like mylist$age).

A matrix is a two-dimensional vector (fixed size, all cell types the same).

An array is a vector with one or more dimensions. So, an array with one dimension is (almost) the same as a vector. An array with two dimensions is (almost) the same as a matrix. An array with three or more dimensions is an n-dimensional array.

A data frame is called a table in most languages. Each column holds the same type, and the columns can have header names.

Example vector code:

v = c(1:3)  # a vector with [1.0 2.0 3.0]
cat(v, "\n\n")

v = vector(mode="integer", 4)  # [0 0 0 0]
cat(v, "\n\n")

v = c("a", "b", "x")
cat(v, "\n\n")

Example list code:

ls = list("a", 2.2)
ls[3] = as.integer(3)

cat(ls[[2]], "\n\n")

ls = list(name="Smith", age=22)
cat(ls$name, ":", ls$age)

Example matrix code:

m = matrix(0.0, nrow=2, ncol=3) # 2x3

Example array code:

arr = array(0.0, 3)  # [0.0 0.0 0.0]

arr = array(0.0, c(2,3))  # 2x3 matrix

arr = array(0.0, c(2,5,4)) # 2x5x4 n-array
# print(arr)  # 40 values displayed

Example data frame code:

people = c("Alex", "Barb", "Carl") # col 1
ages = c(19, 29, 39)  # col 2
df = data.frame(people, ages)  # create
names(df) = c("NAME", "AGE")  # headers
Posted in R Language | Leave a comment

Large Company Workplace Motivation – Not

A friend of mine who works at a very large software company told me about how a bunch of motivational posters were suddenly put up all over his workplace. These posters were intended to be motivational and inspiring. They were everywhere, infesting blank walls and break areas. Here’s an example of one of the posters:


Well, you can guess what happened. This company has a lot of really smart employees and when they saw these posters that had pictures and slogans more suitable to a 6th grade classroom than to one of the largest, smartest companies in the world, the employees were not impressed.

My friend sent me a picture of a poster that appeared in his break room on April 1st:


So, this is all harmless and humorous but points out unintended consequences. Interestingly, the posters actually did improve morale, but not as intended — by bonding rank and file employees against the lameness of the posters.

Workplace motivation is a serious topic and it’s surprising how often management efforts to improve motivation and morale can backfire and have a demotivating effect. “Employee of the Month” awards are a perfect example of what not to do. But in this example, the motivation-improvement effort worked, quite by accident.

Posted in Miscellaneous | Leave a comment

Custom Big Integer Libraries

Many computer science students are somewhat surprised when they learn that in most programming languages the largest possible integer is only 2,147,483,647. Sure 2 billion is pretty large but when you’re dealing with combinations and permutations, integer values can be astronomically large.

The C# and Java languages have add-on libraries to do Big Integer math. The R language has one too called “gmp”. I decided on a whim to see how hard it’s be to code up a custom Big Integer library. Because I’m currently diving into R, I picked it for my language.


My R language Big Integer library is just a set of functions. A big integer object is just an R list of arbitrary length that contains digits like “6”, “3”, and so on, stored as regular integers. Inefficient but simple.

Coding the addition and subtraction functions wasn’t too hard. Multiplication was a bit of a challenge. Division was quite difficult. When implementing division, I found zero useful information on the Internet. Lots of theory and lots of floating point algorithms. Eventually I just simulated ordinary long division by hand.

Once I had my basic routines created, I used them to code up a factorial(n) function and a choose(n, k) function. All in all, it was interesting, good fun, and gave me many insights into the nuances of R language programming.

Posted in Machine Learning, R Language | Leave a comment

R Language S3 Classes

The R programming language has several ways to write object oriented code, including list encapsulation, S3, S4, RC, and R6. In general, the best approach is to use the RC (“reference classes”) technique.


Somewhat surprisingly, there are very few what I consider good examples of how to write S3 code. By good I mean skipping unneeded chit-chat and showing example code.

So, here’s my yet-another-S3 example. I’ll do a Person class. My demo script starts:

# s3person.R
# 3.4.2

# S3 OOP
Person = function(ln="NONAME", a=0, ht=0) {
  this = list(
    lastName = ln,
    age = a,
    height = ht
  class(this) = append(class(this), "Person")

Here I define a Person class where the object will have a lastName, an age, and a height. The “this” is a variable to reference the object and isn’t a reserved word so I could have used “me” or “self”. Notice that an S3 class is a function definition with a bit of extra plumbing with “class” and “append” keywords. (Note: I’m using ‘=’ instead of the preferred arrow operator so my annoying blog software doesn’t go crazy trying to interpret as HTML).

The class doesn’t have any methods. Methods in S3 are optional but useful so next my demo script adds a display method:

display = function(obj) {
  UseMethod("display", obj)

display.Person = function(obj) {
  cat("Last name : ", obj$lastName, "\n")
  cat("Age       : ", obj$age, "\n")
  cat("Height    : ", obj$height, "\n")

Each method in S3 needs (at least) two functions. The first of the pair essentially registers the name (here, “display”) of the method. The second function contains the implementation code. Notice the wacky “methodName.className” pattern of display.Person(). That’s just the magic syntax to use.

Next the demo script adds a method to set the value of a Person lastName:

setLastName = function(obj, ln) {
  UseMethod("setLastName", obj)

setLastName.Person = function(obj, ln) {
  obj$lastName = ln

The pattern should be clear now — register a method name, code up the method implementation.

OK, now my demo script tests the Person class. First:

cat("\nBegin S3 Person class demo \n")

cat("Initializing a default Person, p1 \n") 
p1 = Person()  # default param values
cat("Person 'p1' is: \n")

cat("Initializing a Person object p2 \n")
p2 = Person("Barrow", 29, 68.5)
cat("Person 'p2' is: \n")

The output of this part of code will be:

(prompt) source("s3person.R")

Begin S3 Person class demo 
Initializing a default Person, p1 
Person 'p1' is: 
Last name :  NONAME 
Age       :  0 
Height    :  0 

Initializing a Person object p2 
Person 'p2' is: 
Last name :  Barrow 
Age       :  29 
Height    :  68.5

You create an S3 object just like calling a regular R function. Next the demo concludes by showing how to set and get fields:

cat("Setting p1 directly and with a setter() \n") 
p1 = setLastName(p1, "Ankers")
p1$age = 19
p1$height = 61.1
cat("Person 'p1' is now: \n")

cat("\nEnd demo \n")

Notice the calling pattern when changing an object is object = function(object, value(s)) because of R’s pass-by-value mechanism.

Posted in R Language | Leave a comment