Briefly: The term warm-start training applies to standard neural networks, and the term fine-tuning training applies to Transformer architecture networks. Both are essentially the same technique but warm-start is ineffective and fine-tuning is effective. The reason for this apparent contradiction isn’t completely clear and is related to a new idea being called “the physics of AI”. However, there’s really no contradiction because Transformer architecture networks work quite a bit differently from standard neural networks.
Bear with me for a minute. In warm-start training, you have a standard neural network model such as a CNN image classifier or a regression prediction system. As new data arrives (for example, new house sales data to a house price prediction system), instead of retraining your network from scratch using all your data (the old data plus the new data) with weights randomly initialized, you retrain your network using just the new data, starting with the old weights. This is called warm-start training. As it turns out, surprisingly, warm-start training doesn’t work very well in the sense that new model doesn’t generalize well on new, previously unseen data.
An example of warm-start training
This phenomenon — the relative ineffectiveness of warm-start training — has been explored in the research paper “On Warm-Starting Neural Network Training” by J. Ash and R. Adams.
OK. Now in fine-tuning training, you have a Transformer architecture model such as a GPT-3 large language model. The pre-trained model understands the English language in a way that is not fully understood. To adapt the large model to a specific problem domain, such as an AI chemistry assistant, you train the model starting with the existing GPT-3 weights (175 billion of them) and add the new chemistry data. The resulting model seems to work very well (although as I write this blog post, this is all a very new area of exploration).
Note: Augmenting a large language model in this way is normally accomplished using a technique with relatively little data, called one-shot training or few-shot training.
Note: The terms “warm-start training” and “fine-tuning training” are not rigidly defined in research literature and so they can have different meanings in different research papers.
So, if you think about the standard neural network warm-start training and the Transformer architecture fine-tuning training, they’re both the same technique — you train a new model using new data but starting with the existing model weights. But warm-start training appears to be ineffective but fine-tuning training appears to be effective. It’s likely that pre-trained Transformer architecture networks learn general purpose connections that allow a fine-tuned network to generalize better.
This comparison between standard and Transformer networks points out that deep neural models are not well understood.
One of my work colleagues, Sebastien Bubeck, has suggested an approach to the science of deep learning that roughly follows what physicists do to understand reality:
1.) Explore phenomena through controlled experiments.
2.) Build theories based on simple mathematical models that aren’t necessarily fully rigorous.
Fascinating stuff. By the way, I became aware of these ideas via ad hoc, impromptu hallway conversations at the large tech company I work for. These conversations would not have happened if I was working from home remotely. There’s overwhelming evidence that for the type of work I do, working in a traditional office/lab environment increases productivity and creativity, and which (for me at least) increases my job satisfaction.
If you Google for “physics of AI” you can find a YouTube presentation. Sebastien looks somewhat menacing here but in real-life he’s friendly.
You must be logged in to post a comment.