To train a PyTorch neural network, the most common approach is to read training data into a Dataset object, and then use a DataLoader object to serve the training data up in batches. When I implement a Dataset, I almost always use the NumPy loadtxt() function to read training data from file into memory. But it’s possible to use the Pandas read_csv() function instead. Bottom line: the Pandas approach isn’t especially useful because the Pandas data frame has to be converted to a NumPy matrix anyway.
I used one of my standard examples to code up a demo of NumPy loadtxt() vs Pandas read_csv() functions. The goal is to predict political leaning (conservative = 0, moderate = 1, liberal = 2) from sex, age, state of residence, and income. The data looks like:
1 0.24 1 0 0 0.2950 2 -1 0.39 0 0 1 0.5120 1 1 0.63 0 1 0 0.7580 0 -1 0.36 1 0 0 0.4450 1 1 0.27 0 1 0 0.2860 2 . . .
The columns are sex (M = -1, F = +1), age divided by 100, state (Michigan = 100, Nebraska = 010, Oklahoma = 001), income divided by $100,000, and political leaning. The data is synthetic.
A standard NumPy loadtxt() version of a Dataset is:
import numpy as np import pandas as pd # not used this version class PeopleDataset(T.utils.data.Dataset): def __init__(self, src_file): # numpy loadtxt() version all_xy = np.loadtxt(src_file, usecols=range(0,7), delimiter="\t", comments="#", dtype=np.float32) tmp_x = all_xy[:,0:6] # cols [0,6) = [0,5] tmp_y = all_xy[:,6] # 1-D self.x_data = T.tensor(tmp_x, dtype=T.float32).to(device) self.y_data = T.tensor(tmp_y, dtype=T.int64).to(device) # 1-D def __len__(self): return len(self.x_data) def __getitem__(self, idx): preds = self.x_data[idx] trgts = self.y_data[idx] return preds, trgts # as a Tuple
A version using the Pandas read_csv() and the to_nump() method is:
class PeopleDataset(T.utils.data.Dataset): def __init__(self, src_file): # pandas version xy_frame = pd.read_csv(src_file, usecols=range(0,7), delimiter="\t", comment="#", dtype=np.float32) all_xy = xy_frame.to_numpy() # as above . . .
Instead of using the Pandas to_numpy() function, it’s possible to access the Pandas dataframe directly using the iloc property:
class PeopleDataset(T.utils.data.Dataset): def __init__(self, src_file): # pandas version xy_frame = pd.read_csv(src_file, usecols=range(0,7), delimiter="\t", comment="#", dtype=np.float32) all_xy = np.array(xy_frame.iloc[:,:]) # as above . . .
The rest of the program and the training and test data can be found at: https://jamesmccaffrey.wordpress.com/2022/09/01/multi-class-classification-using-pytorch-1-12-1-on-windows-10-11/.
There’s no big moral to this story — just some fun mental exercise to stay in practice with PyTorch.
Two wonderful illustrations tagged as “amazingsurf” from fractal.batjorge.com. I don’t know the artist, but I’ll bet he does artistic exercises to stay in practice.
You must be logged in to post a comment.