StarCraft II players don’t train for 200 years – the limits of current AI

Having conquered games like chess and go, AI researchers are now tackling real time strategy games like Starcraft and Dota. AlphaStar, a product of Google’s DeepMind research labs, has beaten some of the worlds best Starcraft 2 players and Microsoft recently, invested $1 billion in OpenAI and company that has worked on the game Dota 2.

AlphaStar player MaNa at Starcraft 2

The image above shows AlphaStar playing the pro Starcraft player MaNa in a Protoss versus Protoss game. It is truly impressive to see how it is able to generate entirely new strategies to outwit its opponent.

DeepMind were careful to hobble the AI to create a fair comparison. An important concept in Starcraft 2 (and other real time strategy games) is the Actions Per Minute (APM) – how many commands are players issuing in a minute. Some proplayers are remarkable in how quickly they can play. The World champion, Serrel often spikes above 700 actions per minute and averages over 300 actions per minute. With enough computational power though, machines could exceed these levels easily. Such a contest would not be entirely fair though. It would be like pitting a human against a machine at doing numerical computations – impressive, but not very surprising. For this reason, AlphaStar restricted the number of actions that AlphaStar could perform to human like numbers. It still spikes quite high sometimes, but not completely ridiculous numbers.

In other ways though, AlphaStar retains some advantages above what its human opponents could ever hope to achieve. It is, for instance, able to pay attention to the whole game all the time, where as human players must divide their attention. That is, an issue which could be addressed and in fact the underlying architecture actually uses selective attention mechanisms to improve its performance.

However, there is another issue that is more fundamental. It took AlphaStar about 200 years of real time equivalent play to reach this level of proficiency. Keep in mind it was being trained to play on just one map and only on the Protoss versus Protoss match up. Clearly, human players do not have this length of time to learn and generalize across a much wider range of scenarios.

Clearly people are doing something different from what AlphaStar is and I would suggest that the difference goes to the heart of how AlphaStar learns. There are a great many components to the system and the article provides pointers to papers about these components. However, AlphaStar and most of the successful AI enterprises of recent years rely on error driven learning. In error driven learning algorithms, a representation of the current situation is presented to the system and and it produces its current response. Then incremental changes to parameters are made based on a teacher signal, which may be the choices made by human players in the case of supervised learning or a general goodness signal in the case of reinforcement learning. Typically, the changes that are made on any given iteration are small as it is difficult to assign blame for the outcome to any specific unit within the system. It is only over time that patterns of blame are identified, which the system then remedies.

People are able to learn much more rapidly – in the memory literature we talk of one shot learning. Critically, the significance of a particular experience may not be able to be determined until the time of retrieval. Rather than create a distillation of the function that must be computed as is done in many machine learning models, memory models store traces and decide on the relevance of traces when the retrieval cue is presented.

The classic example of such a model is Minerva II which was developed by Doug Hintzmann and the 1980s. In Minerva II, traces represented as vectors are stored in memory and outputs are computed by comparing the current input to each of these traces and then summing the results. The model has the advantage that it can be used to elicit specific memories if one has access to specific cues, or to generalize across many memories if the cues are less specific. The ability to flexible switch between episodic and schema extraction dynamics is missing in current machine learning models and I think is going to be important to realizing human like learning and generalization capabilities without requiring hundreds of effective years of training.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s