Last week me and a couple of friends had an idea: let’s use PySC2 (a Python wrapper for the StarCraft 2 API) to build a reinforcement learning agent that can teach itself how to play StarCraft 2.
This is no easy task, and many people have attempted it:
But we thought it would be an interesting way to get into the world of Reinforcement Learning.
Our agent is an A2C network, that, if trained, right now, converges to outputting only 0s.
Below is our first version of the AI, taking random actions. The map is called CollectMineralShards; the aim is to move your marines (the two green circles) to as many Mineral Shards (blue circles) as possible in the allotted time.
Getting it to run
If you want to try out our project yourselves, head over to https://github.com/deepmind/pysc2 and follow their instructions to get the StarCraft2 environment.
For our algorithm we used PyTorch, so make sure you have that installed and running.
After you install PyTorch, head over to github, clone our repository, and run it according to the instructions: https://github.com/Tzeny/deepstellar
PySC2 – StarCraft II Learning Environment
PySC2 is DeepMind’s Python component of the StarCraft II Learning Environment (SC2LE). It exposes Blizzard Entertainment’s StarCraft II Machine Learning API as a Python RL Environment.
It can run one or two agents / game, and many games in parallel.
Agent get an observation of the game state after each N in game time steps. N can be set, in our case we used N = 16 for an equivalent APM of 90.
Below is the code for an agent that takes a random action at each time step.
"""A random agent for starcraft."""
def step(self, obs):
super(RandomAgent, self).step(obs)
function_id = numpy.random.choice(obs.observation.available_actions)
args = [[numpy.random.randint(, size) for size in arg.sizes]
for arg in self.action_spec.functions[function_id].args]
return actions.FunctionCall(function_id, args)
The obs object contains valueable information about the game state:
- Screen (you can select any combination of the 2 items below)
- Features (shown in Figure 1. above)
- RGB pixels
- Minimap (you can select any combination of the 2 items below)
- Features (shown in Figure 1. above)
- RGB pixels
- Player information
- Control groups
- Single select
- Multi select
- Cargo
- Build queue
- Alers
- Available actions
- Last actions (only for successful actions)
- Action result
Reinforcement Learning – A2C
Our agent has to look at the observations for the current step, choose an action that would best further its goals, and predict a value for the current state.
A state’s value = the sum of all the rewards if you were to start in that state and move forward
Intuitive explanation of A2C: https://hackernoon.com/intuitive-rl-intro-to-advantage-actor-critic-a2c-4ff545978752
All A2C architectures have 2 heads:
- Actor – responsible for choosing an action, in our case the actor outputs both an action_id distribution, and 4 continuous values used as arguments for some of the actions (for example the 331/Move_screen command requires a point (x,y) on the screen as argument)
- Critic – responsible for deciding how good the actor’s actions are
There are some inputs that have variable sizes, so we are not feeding them to the network just yet.
The idea is to run the model for a number of time steps (10 in our case), and then train both the actor and critic.
I will explain the loss function in the next post, as I don’t have a good understanding of it yet.