While I was walking my dogs one weekend, I was thinking about the PyTorch Dataset object. A Dataset object is part of the somewhat complicated system needed to fetch data and serve it up in batches when training a PyTorch neural network.
A Dataset is really an interface that must be implemented. When you implement a Dataset, you must write code to read data from a text file and convert the data to PyTorch tensors. I noticed that all the PyTorch documentation examples read data into memory using the read_csv() function from the Pandas library. I had always used the loadtxt() function from the NumPy library. I decided I’d implement a Dataset using both techniques to determine if the read_csv() approach has some special advantage.
Conclusion: Using read_csv() to read data for a PyTorch Datset has no advantage over using the loadtxt() function.
If you use the loadtxt() function, the result is a NumPy matrix, which can be fed directly to a tensor constructor. But if you use the read_csv() function, the result is a DataFrame, which must be converted to a NumPy matrix before feeding to a tensor constructor.
Many Python developers seem to have an exaggerated fondness for Pandas. Pandas is very flexible and very useful in some scenarios. But for reading data for use in a Dataset object, the NumPy loadtxt() function is simpler than using the Pandas read_csv() function.
Here’s a snippet of the loadtxt() version:
x_data = np.loadtxt(src_file, max_rows=num_rows, usecols=range(1,5), delimiter="\t", skiprows=0, dtype=np.float32) self.x_data = T.tensor(x_data, dtype=T.float32).to(device)
And here’s a snippet of the read_csv() version:
x_frame = pd.read_csv(src_file, sep="\t", header=None, usecols=range(1,5), dtype=np.float32, nrows=num_rows) x_data = np.array(x_frame.iloc[:,:]) # all rows, all cols self.x_data = T.tensor(x_data, dtype=T.float32).to(device)
There’s a non-technical moral to this story. I spent several hours on my investigation on a Saturday evening because I wanted to, not because I had to. All the good computer scientists I know — and I know some really good ones — are the same way. We study computer science relentlessly — before, during, and after work — because we are passionate about it, and truly enjoy it.
This is what people who are advocating for greater participation in STEM careers by underrepresented groups don’t get: if a corporate or government program tries to convince “underrepresented” young people to go into computer science or math only because the jobs pay well or because of some misguided notion of social justice, these young people are being done a huge disservice. Not everyone can be a computer scientist, or an opera singer, or an expert salesman, or a biochemist. People who are steered into a job where they don’t have the intellectual ability and passion to succeed will likely be very unhappy.
The American education system is set up so that anyone who a.) has the intellectual ability for, and b.) a deep passion for computer science, can succeed, without any political encouragement. A Python language compiler does not care who wrote the code. My philosophy is: Let all people make up their own minds where there abilities and passion are.
Passion is a good thing, but in movies passion doesn’t always end well. Left: “Romeo and Juliet” (the 1968 version). Center: “Mutiny on the Bounty” (the 1984 version). Right: “Troy” (the 2004 version). Albert Einstein said, “Love is a better teacher than duty.”