Last week I was working on a project that involved using data files and realized that testing the validity of data files is a topic that isn’t discussed much. The project I was working on was partitioning graphs — dividing a graph into two parts in such a way that the number of connections between the two subparts (“cuts”) is minimized. This turns out to be a difficult problem and so there are libraries of test data files of graphs that you can use as input to your graph partitioning code. Similar libraries of data files exist for the traveling salesman problem, the data clustering problem, and many others. Each library of data files has a specific format which describes exactly how the data is represented. For example, with the files I was looking at, lines which begin with the ‘#’ character are comments and the first non-comment line in the file must have two numbers separated by a blank space where the first number is the count of nodes in the graph and the second number is the count of connections in the graph. Anyway, before I ran my graph partitioning code I wanted to validate that the target data file followed all the rules I was expecting. This was a fairly time-consuming and tedious process and one which really isn’t feasible using a manual approach. I don’t have a moral except that maybe testing the validity of data files is a common testing task that seems to have few general guiding principles. Maybe this is why I don’t read about testing data files very often.