https://github.com/geneura-papers/2017-StarCraft-Data
Raw File
Tip revision: 9ccbf551898127f82deba0cd6db4ab3ba3da9e3b authored by JJ Merelo on 29 December 2017, 11:09:48 UTC
Done here
Tip revision: 9ccbf55
starcraft-data.Rmd
---
title: 'RedDwarfData: a simplified dataset of StarCraft matches'
author: |
  | Juan J. Merelo-Guervós, Antonio Fernández-Ares
  | University of Granada, Spain
  |
  | Antonio Álvarez Caballero
  | University of Oviedo, Spain
  |
  | Pablo García-Sánchez
  | University of Cádiz, Spain
  |
  | Victor Rivas
  | University of Jaen, Spain
date: "December 29th, 2017"
output:
  pdf_document: 
    keep_tex: true
bibliography: prediction.bib
abstract: The game Starcraft is one of the most interesting arenas to test new machine
  learning and computational intelligence techniques; however, StarCraft matches take
  a long time and creating a good dataset for training can be hard. Besides, analyzing
  match logs to extract the main characteristics can also be done in many different
  ways to the point that extracting and processing data itself can take an inordinate
  amount of time and of course, depending on what you choose, can bias learning algorithms.
  In this paper we present a simplified dataset extracted from the set of matches
  published by Robinson and Watson, which we have called RedDwarfData, containing
  several thousand matches processed to frames, so that temporal studies can also
  be undertaken. This dataset is available from GitHub under a free license. An initial
  analysis and appraisal of these matches is also made.
---

```{r setup, include=FALSE}
#Librerías
knitr::opts_chunk$set(cache=TRUE)
source("scripts/funciones.R")
library(ggplot2)
library(ggthemes)
load("scripts/datos/duration.Rdata")

#Tamaño del conjunto de datos
unique.ids <- length(duration$id)

```

# Introduction

\emph{StarCraft} is a well-known Real Time Strategy (\emph{RTS}) game whose first version was released in 1998 [@yoon2001starcraft]. The game consists of two opponents trying to beat each other over a square-shaped arena. Every player has to spawn workers and create buildings using the available resources, which are located over the terrain. Players can use one of three different kinds of species, called Zergs, Protoss or Terrans, with different kinds of individuals, which differ in their skills, so that players have a wide range of possibilities in order to plan their strategies. The need to dynamically take decisions for both the short and the long terms make RTS adequate scenarios to test AI algorithms [@ontanon2013survey] that could be used later in real problems including financial, military and logistics scenarios. 

\emph{StarCraft} has become popular among researchers thanks in part to \emph{BWAPI}, an API developed by the community that allows programming artificial agents and that was developed for StarCraft competitions [@buro2012real]. These agents can act just like regular players, according to a previously established algorithm; but, more importantly, they can collect data as matches are running. The data collected this way can be later analysed, so that machine learning techniques can be applied for both supervised and unsupervised learning.

Developing and releasing datasets for developing and applying these techniques is thus a way of expanding research in this area and allowing researchers to develop their own bots or prediction algorithms without having to scrape them from websites or actually play the games, a time-consuming task, to extract data from them. Several researchers have done so and we will refer to them in the state of the art; in this paper we start from Robertson and Watson's dataset [@RobertsonW14] and create a simpler one that can be used without needing many resources; even so, the data set is balanced with respect to duration, races and number of resources used by the players. 

The rest of the paper is organized as follows: next we will present the Starcraft datasets that have been published so far. In the next section we will show how Robertson and Watson's dataset was processed and the rationale for doing it that way. Then we will present a macro picture of the data set, to eventually draw some conclusions from it and possible lines of work using this dataset.

## State of the art

The main motivation for writing this paper was the proliferation of new StarCraft datasets in the last few years. Players have been uploading their matches command logs to several websites, such as TeamLiquid \footnote{\url{http://www.teamliquid.net}} or BWReplays \footnote{\url{http://www.bwreplays.com}} for over 10 years. These logs are usually employed by the player community to learn new playing techniques from other players. From the point of view of using them for AI training, however, these datasets are in raw form and require some kind of command execution, parsing, data selection and filtering in order to serve as a good base for learning models [@Stardata17]. Moreover, a lot of uploaded replays may be corrupt or incomplete. Different authors have been processing these replays to create datasets that can be used to predict the outcome of a match or a battle or simply gather insights on gaming strategies. The first paper that describes the processing of replays was the one presented by Weber and Mateas [@Weber09], where they processed replays to create a dataset to compare several classification algorithms. 5393 replays were used and they store specific events during the games, not the full game state, validating the possibility of predicting the winner of a game to a certain degree. Later, other datasets that store the full game state to apply generic machine learning techniques have appeared [@Synnaeve12].

Instead providing the game logs directly, Robertson and Watson [@RobertsonW14] distributed the logs in a database, created by obtaining the replays from several replays webpages, executing the game and storing a large number of game features data every certain number of frames. This dataset has been used by other researchers, for example, to create models differentiating by race [@Ravari16].

Recently, and built upon the Synnaeve and Bessiere work [@Synnaeve12], Lin et al. [@Stardata17] have released the largest dataset to date. It includes 65646 replays, storing the full state of the game every 3 frames, including up to 30 features by unit and battle detections, storing all the data in TorchCraft format to be accessed by specific software clients. Even using compression mechanisms, this dataset requires 365GB of space. The aim of using this dataset is to apply Deep Learning algorithms that require a large amount of data.

However, there is not a small and limited dataset ready to use for teaching purposes, for example, to directly use as input for certain algorithms without any modification. This is the reason we have used the Robertson and Watson database [@RobertsonW14] to extract a limited number of features and parsed it directly to CSV (Comma Separated Values) format. These CSV files can be easily downloaded and straightforwardly parsed with any programming language. <!--PABLO: Any other ideas to justify our dataset?-->

## Data extraction and filtering

One of the advantages of the presented dataset is that they are already pre-processed and ready to use. The origin of the samples are the replays obtained by [@RobertsonW14]. As previously stated, they provided the data in the form of a huge relational database, so an intermediate step to produce an usable and manageable CSV file was needed. <!--A query which extract data from one single replay has been developed; it is explained on the lines below. A simple function which iterates over all the replays has been developed too, but the most difficult task is parsing individual replays. PABLO: these phrases do not add useful information--> 

For every available match in the database, identified by it \emph{ReplayID}, a different number of samples is extracted. The first step is to take the \emph{Duration} and the identifiers \emph{PlayerReplayID} of the two players involved in the game. The \emph{PlayerReplayID} with the minimum value is taken as the first player; this does not actually introduce a bias. <!-- Pablo: why? Is always the player 1? Because usually people upload the games if they win
Antonio: The dataset is nearly-balanced, so I think there is no problem here :)
-->

Because of the dynamic nature of the game, it is important to use a time reference to be able to track the match in every moment. Instead sampling the matches every certain number of frames as other authors propose, the chosen reference is the \emph{Frame} when a change in resources (\emph{Minerals}, \emph{Gas} and \emph{Supply}) occurs. This instant does not have to be the same for the two players, so all the instants when a change is discovered are saved. This choice implies a lot of empty values in the database; however, this is easily fixed taking the last known value to fill each empty one: if there is not a change for a player, the resources would be the last seen ones.

This decision is also important to track the units, that are valued by their price in resources. However, the value recorded in the original database is split by regions, which are pieces of the map. To register the value of some units, all the region's values have to be summed up, but the frame when these changes occur are not the same that the  resources' ones, so the last change for each region is saved and accumulated. <!--Pablo: I do not understand the last phrase
Antonio: Let 2000 be the current frame, because a gas change is performed. Maybe the changes in a region occurs in 1400 and 2300, so as 1400 is lower than 2000 (the reference), that is the frame selected to sum up.
--> The same process is also performed for the enemy units. However, this value is an estimation, not a real one, because due to the rules, a player do not know nothing about the enemy unless they see the other, and at that moment, these real values are stored in the logs. <!-- Antonio: I think this is important, it is an estimated value! But "they see the other" sounds strange... Pablo: yeah, sounds good, I added that they are also stored in the log at that moment. Could you explain the "region" concept? . Antonio: Done.-->

<!--These units values are also available at the enemy side: the estimation of the enemy values is recorded in the dataset, so the same process is also done for these records. Pablo: I don't know when to use the 's with some names, but I think it is only used when the name "possesses" the thing :P However, I summarized this paragraph above -->

With all the data well-formed, only the \emph{ReplayID}, \emph{Winner} and \emph{Duration} have to be added to the match. The data of a game does not have any missing values, so they are ready to be saved. As it was said before, this process has to be repeated for each match. 


## The big picture

A few macro analysis of the data in this dataset will allow us to compare it with other datasets and also appraise its balance and general features. The data set, which is available [from GitHub under a free license](https://github.com/analca3/StarCraft-winner-prediction/tree/master/data), includes `r unique.ids` different matches with matches among all different races available in Starcraft. 

The duration of the matches is represented below.

```{r duration,echo=FALSE}
ggplot(duration,aes(x=maxFrames/24/60)) + 
  stat_bin(breaks=c(0,5,10,15,20,25,30,35,40,45,50,55,60,75,90,120,150,180),color="white") +
  stat_bin(breaks=c(0,5,10,15,20,25,30,35,40,45,50,55,60,75,90,120,150,180), geom="text", aes(label=..count..), vjust=-1.5) +
  scale_y_continuous(limits = c(0,1000)) + 
  scale_x_continuous(breaks = c(seq(from=0,to=60,by=5),75,90,120,150,180),minor_breaks = seq(from=0,to=165,by=5)) +
  xlab("Duration (minutes)") + ylab("Number of games") +
  theme_bw()
```


The mode is between 20 and 25 minutes, with average equal to `r mean(duration$maxFrames/24/60)` and median `r mean(duration$maxFrames/24/60)`; they are not too different which indicates that the distribution is not too skewed. 

```{r wl,echo=FALSE}
load("scripts/datos/match.Rdata")
ggplot(aes(x=WL),data=match.data)+geom_bar()+theme_tufte()+xlab("Winner.Loser")+ theme(axis.text.x = element_text(angle = 90, hjust = 1))
```
The combination Terran vs. Protoss includes the bigger number of
matches, with close to two thousands, of which the majority are won by
the Terran. But in fact, the race that wins the most games it is involved it are the Zerg, as is shown below. 

```{r wins, echo=FALSE}
wins.data <- data.frame(Race=character(),
                        Wins=integer(),
                        Loses=integer(),
                        Ratio=double())

for (i in c("Protoss","Zerg","Terran")) {
    race.data <- match.data[ match.data$Winner == i |  match.data$Loser == i,]
    wins.data <- rbind( wins.data,
                       data.frame( Race=i,
                                  Wins=length(race.data[match.data$Winner == i,]$id),
                                  Loses=length(race.data[match.data$Loser == i,]$id) )
                       )
}
wins.data$Ratio = wins.data$Wins/(  wins.data$Wins + wins.data$Loses)
ggplot(aes(x=reorder(Race,-Ratio),y=Ratio),data=wins.data)+geom_bar(stat="identity")+theme_tufte()+xlab("Winner matches per race")
```

Both macro measures show that the data set is well balanced, and includes battles in all possible combinations. On the other hand, we can use match data to find out if the duration of the match is affected by the combination of races

```{r match.duration,echo=FALSE}
 ggplot(aes(x=WL,y=Duration),data=match.data)+geom_boxplot()+theme_tufte()+xlab("Winner.Loser")+ theme(axis.text.x = element_text(angle = 90, hjust = 1))
```

And the chart shows that, except for the matches that include Zerg against Zerg and Protoss vs. Protoss, duration is quite similar and is more affected by the game dynamics that by differences or similarities between races. 

## Conclusions

Our intention with the release of RedDwarfData and this report was to introduce a new lightweight dataset that includes matches by all races and that can be used for drawing conclusions about Starcraft matches, as well as use them to actually train bots that play Starcraft. An initial exploratory analysis draws the conclusion that the Zerg might be the best race to use in a game bot; however, this macro conclusion will have to be qualified by an actual analysis of the strategies generally used by Zergs in the particular games included in the data set.
RedDwarfData and the scripts that are used to process it are hosted [in Github: `https://github.com/analca3/StarCraft-winner-prediction`](https://github.com/analca3/StarCraft-winner-prediction). This paper is hosted also in GitHub, and includes the scripts used to generate aggregated data and the data that has been used to generate this paper, which is written in R Markdown and includes the scripts to generate these charts. It has a Apache license and can be downloaded [from GitHub at `https://github.com/geneura-papers/2017-StarCraft-Data`](https://github.com/geneura-papers/2017-StarCraft-Data). 
This dataset has been used already in [@ijcci17starcraft] to make early predictions of the outcome of the match, in order to shorten the time needed to evaluate bots in an optimization environment. Our intention is to create surrogate models of matches so that we can use evolutionary algorithms to create game bots that can be competitive in this game. 

## Acknowledgements

This paper has been funded in part by the Spanish national excellence project TIN2014-56494-C4-3-P (EphemeCH). We acknowledge also the support of TIN2017-85727-C4-2-P (DeepBio).

## References
back to top