For many machine learning problems, a very common procedure is to read data from a text file into a matrix, and then from that matrix create a training matrix (with typically a random 70 or 80% of the data items) and a test matrix (with the remaining items). There are quite a few ways to approach this problem. The diagram below shows a technique I use that balances efficiency, clarity, and side-effects.

Suppose the raw data in a text file corresponds to the famous Iris data set and looks like:

5.1,3.5,1.4,0.2,0,0,1 7.0,3.2,4.7,1.4,0,1,0 6.3,3.3,6.0,2.5,1,0,0 . . .

The first four values in each line are the predictors (petal length, petal width, sepal length, sepal width) and the last three values represent the species to predict, where (0,0,1) is setosa, (0,1,0) is versicolor, and (1,0,0) is virginica.

Let’s assume that these values have been stored in an array-of-arrays style matrix named data[][], by means of some method LoadData that reads the source file, parses each line, and stores each value.

We start like so:

static void MakeTrainTestByRef(double[][] allData, int seed, out double[][] trainData, out double[][] testData) { Random rnd = new Random(seed); int totRows = allData.Length; . . .

The method accepts the source matrix, and a seed value for the randomization process (because we want the data items to be distributed randomly). The results are out-parameters. I don’t like using out-parameters, but for this problem they make sense. The train-test split is hard-coded as 80%-20% but you could add two parameters along the lines of trainPct and testPct — or just trainPct because the testPct value is determined by the trainPct (they must sum to 100%).

The method begins by creating a Random object and storing the number of rows into a local variable for easier readability. Next:

int numTrainRows = (int)(totRows * 0.80); int numTestRows = totRows - numTrainRows; trainData = new double[numTrainRows][]; testData = new double[numTestRows][];

The number of rows for the training and test matrices are computed, and the result matrices are allocated. Working with matrices can be tricky especially if, like many developers, you don’t work with them frequently. Next, a copy of all data is made, by reference:

double[][] copy = new double[allData.Length][]; for (int i = 0; i < copy.Length; ++i) copy[i] = allData[i];

We make a copy so that the original data matrix will not be affected by the row-scrambling. We make the copy by reference because the data matrix might be huge. Next, the rows of the copy matrix are scrambled using the Fisher-Yates algorithm:

for (int i = 0; i < copy.Length; ++i) { int r = rnd.Next(i, copy.Length); double[] tmp = copy[r]; copy[r] = copy[i]; copy[i] = tmp; }

The method finishes by assigning, by reference, the train and test matrices:

. . . for (int i = 0; i < numTrainRows; ++i) trainData[i] = copy[i]; for (int i = 0; i < numTestRows; ++i) testData[i] = copy[i + numTrainRows]; } // MakeTrainTestByRef

In machine learning situations where the entire source data can fit into memory, I’ve used this technique often and it meets my needs most of the time.

I couldn’t comment on the topic “Precision, Recall, Type I Error, Type II Error, True Positive and False Positive, and ROC Curves”, so I’m doing it here. The example says “Unknown to you, 74 of those people are in fact U.S. citizens and 16 are not U.S. citizens”, so 74 + 16 = 90, but the total should be 100.

Thank you. You are right. The 74 should have been 84. I corrected the post. Interestingly, that mistake did not change the calculation of either precision or recall.