I was chit-chatting with one of my work colleagues about an algorithm he created for hyperparameter tuning. The algorithm is called Distributed Grid Descent (DGD).
Every neural prediction system has hyperparameters such as training learning rate, batch size, architecture number hidden nodes, hidden activation, and so on. Complex systems can have 10-20 hyperparameters.
In regular grid search, you set up candidate values, for example, lrn_rate = [0.001, 0.005, 0.010, 0.015, 0.020, 0.025] and bat_size = [4, 8, 10, 16, 32, 64] and max_epochs = [100, 500, 1000, 2000, 3000, 4000, 5000, 6000, 7000, 8000, 9000] and then try all possible combinations. The program random number generator seed value is sort of a hyperparameter — you should try each set of hyperparameters with typically about 10 different seed values and then average the result values (often mean squared error).
Hyperparameter tuning with candidate values is a type of combinatorial optimization problem.
It’s not always possible to try all possible grid hyperparameter value combinations because some systems have too many hyperparameter combinations and some systems have very long training times.
My colleague’s DGD algorithm searches through hyperparameter values. The DGD algorithm is a form of evolutionary optimization that uses only mutation to generate a new candidate hyperparameter set solution. Some evolutionary optimization algorithms use crossover in addition to mutation.
The DGD algorithm is explained in an appendix to the 2020 research paper “Working Memory Graphs” by R. Loynd et al. from the Proceedings of the 37th International Conference on Machine Learning.
I got my love of combinatorial mathematics from playing poker when I was at Servite High School in Anaheim, California, when I should have been doing my Latin homework. We usually played at the house of my classmate Michael Ventriglia. Three random images from an Internet search for “poker face”.