Hyperparameter Tuning Using Distributed Grid Descent

I was chit-chatting with one of my work colleagues about an algorithm he created for hyperparameter tuning. The algorithm is called Distributed Grid Descent (DGD).

Every neural prediction system has hyperparameters such as training learning rate, batch size, architecture number hidden nodes, hidden activation, and so on. Complex systems can have 10-20 hyperparameters.

In regular grid search, you set up candidate values, for example, lrn_rate = [0.001, 0.005, 0.010, 0.015, 0.020, 0.025] and bat_size = [4, 8, 10, 16, 32, 64] and max_epochs = [100, 500, 1000, 2000, 3000, 4000, 5000, 6000, 7000, 8000, 9000] and then try all possible combinations. The program random number generator seed value is sort of a hyperparameter — you should try each set of hyperparameters with typically about 10 different seed values and then average the result values (often mean squared error).

The DGD algorithm.

Hyperparameter tuning with candidate values is a type of combinatorial optimization problem.

It’s not always possible to try all possible grid hyperparameter value combinations because some systems have too many hyperparameter combinations and some systems have very long training times.

My colleague’s DGD algorithm searches through hyperparameter values. The DGD algorithm is a form of evolutionary optimization that uses only mutation to generate a new candidate hyperparameter set solution. Some evolutionary optimization algorithms use crossover in addition to mutation.

The DGD algorithm is explained in an appendix to the 2020 research paper “Working Memory Graphs” by R. Loynd et al. from the Proceedings of the 37th International Conference on Machine Learning.

Interesting stuff.

I got my love of combinatorial mathematics from playing poker when I was at Servite High School in Anaheim, California, when I should have been doing my Latin homework. We usually played at the house of my classmate Michael Ventriglia. Three random images from an Internet search for “poker face”.

This entry was posted in Machine Learning. Bookmark the permalink.

1 Response to Hyperparameter Tuning Using Distributed Grid Descent

  1. Thorsten Kleppe says:

    Hi Dr. McCaffrey,

    You are still making sure that this is the best place to dive deep into ML. Really interesting, I hope you will create a demo.

    The thing with the seeds is really difficult. I wish we could just leave out the weight seed, but that doesn’t seem possible yet. Anyway, the influence of the seed is very high, which can easily lead to a wrong track. Moreover, two different sized networks usually have different positions for the same seed, which again mixes everything up and practically acts like a new, more complex seed.

    On the other hand, we can find more robust networks that give good results even under unfavorable conditions.

    Thanks for the cool Saturday post. Seems like a hot topic for me to learn.

Leave a Reply

Please log in using one of these methods to post your comment:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s