The recent surge of interest in Large Language Models (LLMs) like GPT-4 has motivated me to set aside my mathematical research for a time and focus instead on A.I. The current LLM revolution does not seem to be a passing fad.
I’m predicting not one, but three A.I. revolutions that I expect to follow in fairly short order:
- The Large Language Model Revolution
- The Machine Vision Revolution
- The Personal Robotics Revolution
The current LLM revolution, triggered by OpenAI’s GPT-3 and GPT-4 models, is based on the Transformer architecture introduced in the 2017 paper “Attention is all you need”.
The attention mechanism allows the neural network architecture to select a small subset of its state space for processing. One model I’ve studied can use attention to pick six parameters out of 8,192. Much as a human being could not simultaneously process all of the words on a printed page, attention allows a neural network to focus on a small subset of its state. This technique considerably reduces the number of parameters required (six instead of 8,192), which is a big win since the number of parameters is already huge (billions or even trillions).
This approach has been stunningly successful in building interactive language models that respond to instructions much like a human. Of course, the current LLMs make all kinds of mistakes and require significant computation resources to operate. Training them requires massive computational resources that are currently beyond the reach of most software developers.
I suspect that the same transformer architecture can be used effectively for machine vision, since the human brain probably uses the same basic neural network architecture for language, speech, and visual recognition. Vision recognition seems like another task that requires focusing attention on a small subset of a large and complicated collection of data. Human visual recognition seems to be based on a complex model of our surrounding world that we develop over years. Optical illusions demonstrate that our vision system is based not on processing massive amounts of data, but rather matching that data to our pre-trained visual model.
I expect that we’ll soon see impressive advances in machine vision recognition, with large parameter machine vision models being developed with human-like capability to recognize the contents of images. Small- and medium-sized machine vision models should be useful for more specialized tasks, such as observing an assembly line or a kitchen. This will constitute the Machine Vision Revolution, and it is likely already under way.
Once the transformer architecture has been effectively applied to build machine vision models, the major remaining challenge will be to optimize them so that they can recognize video in real-time. This is the primary current obstacle to robotics. Actually building a robot arm, for example, is fairly easy, but figuring how to direct it is much harder. Consider how difficult everyday tasks are for the blind. Once comprehensive machine vision can be done in real-time, the Personal Robotics Revolution should soon follow.
Let me end on a note of warning for the free software community. OpenAI is well ahead of us in Large Language Models, and Boston Dynamics is well ahead of us in robotics. Advanced neural network architectures appear to be significantly different from compilers and operating systems in that their massive compute requirements put them out of reach of the average small developer. To truly master this technology, we must train our own models, and probably lots of them, in order to understand how they work. A distributed machine learning system, based on something like BOINC, might allow us to combine the computing power of many smaller computers to train these models without spending millions of dollars.
To those open source advocates who seem to be content with pre-trained models like Falcon and LLaMA, I would like to ask a question:
Would a leaked version of Intel C++ Compiler be a suitable replacement for gcc?