PyTorch Dataset: Reading Data Using Pandas vs. NumPy

While I was walking my dogs one weekend, I was thinking about the PyTorch Dataset object. A Dataset object is part of the somewhat complicated system needed to fetch data and serve it up in batches when training a PyTorch neural network.

A Dataset is really an interface that must be implemented. When you implement a Dataset, you must write code to read data from a text file and convert the data to PyTorch tensors. I noticed that all the PyTorch documentation examples read data into memory using the read_csv() function from the Pandas library. I had always used the loadtxt() function from the NumPy library. I decided I’d implement a Dataset using both techniques to determine if the read_csv() approach has some special advantage.

Conclusion: Using read_csv() to read data for a PyTorch Datset has no advantage over using the loadtxt() function.

If you use the loadtxt() function, the result is a NumPy matrix, which can be fed directly to a tensor constructor. But if you use the read_csv() function, the result is a DataFrame, which must be converted to a NumPy matrix before feeding to a tensor constructor.

Many Python developers seem to have an exaggerated fondness for Pandas. Pandas is very flexible and very useful in some scenarios. But for reading data for use in a Dataset object, the NumPy loadtxt() function is simpler than using the Pandas read_csv() function.

Here’s a snippet of the loadtxt() version:

x_data = np.loadtxt(src_file, max_rows=num_rows,
  usecols=range(1,5), delimiter="\t", skiprows=0,

self.x_data = T.tensor(x_data, dtype=T.float32).to(device)

And here’s a snippet of the read_csv() version:

x_frame = pd.read_csv(src_file, sep="\t", header=None,
  usecols=range(1,5), dtype=np.float32, nrows=num_rows)

x_data = np.array(x_frame.iloc[:,:])  # all rows, all cols

self.x_data = T.tensor(x_data, dtype=T.float32).to(device)

There’s a non-technical moral to this story. I spent several hours on my investigation on a Saturday evening because I wanted to, not because I had to. All the good computer scientists I know — and I know some really good ones — are the same way. We study computer science relentlessly — before, during, and after work — because we are passionate about it, and truly enjoy it.

This is what people who are advocating for greater participation in STEM careers by underrepresented groups don’t get: if a corporate or government program tries to convince “underrepresented” young people to go into computer science or math only because the jobs pay well or because of some misguided notion of social justice, these young people are being done a huge disservice. Not everyone can be a computer scientist, or an opera singer, or an expert salesman, or a biochemist. People who are steered into a job where they don’t have the intellectual ability and passion to succeed will likely be very unhappy.

The American education system is set up so that anyone who a.) has the intellectual ability for, and b.) a deep passion for computer science, can succeed, without any political encouragement. A Python language compiler does not care who wrote the code. My philosophy is: Let all people make up their own minds where there abilities and passion are.

Passion is a good thing, but in movies passion doesn’t always end well. Left: “Romeo and Juliet” (the 1968 version). Center: “Mutiny on the Bounty” (the 1984 version). Right: “Troy” (the 2004 version). Albert Einstein said, “Love is a better teacher than duty.”

This entry was posted in PyTorch. Bookmark the permalink.

3 Responses to PyTorch Dataset: Reading Data Using Pandas vs. NumPy

  1. Thorsten Kleppe says:

    Thanks for the daily bam! Men like you were the reason I lost my fear of this STEM monster.
    In Germany we call it MINT (Mathematik, Informatik, Naturwissenschaft und Technik).

    It’s nice to have you James. 🙂

  2. Peter Boos says:

    The Pandas module is used for working with tabular data. It allows us to work with data in table form, such as in CSV or SQL database formats. We can also create tables of our own, and edit or add columns or rows to tables. Pandas provide us with some powerful objects like DataFrames and Series which are very useful for working with and analyzing data.

    While the Numpy module is mainly used for working with numerical data. It provides us with a powerful object known as an Array. With Arrays, we can perform mathematical operations on multiple values in the Arrays at the same time, and also perform operations between different Arrays, similar to matrix operations.

    remind also the quick views in panda on jupyter, so that’s probably the reason people prefer panda.

Leave a Reply

Please log in using one of these methods to post your comment: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s