teach yourself datamining in 21 days
July 5, 2010
Things may look quiet here but behind the scenes I’ve been working on my project to get up to speed with modern datamining algorithms. The first step has been to assemble some sources of information and some tools for the job.
For a textbook, I’ve chosen Introduction To Data Mining by Tan, Steinbach and Kumar. It provides a good overview of the key algorithms, along with important issues like data quality and consistency. It also introduces the maths in a reasonably gentle way.
Fortunately, while it is important to understand how the algorithms work, it is not necessary to work the maths by hand. There are some first class freeware datamining programs available that do all the heavy lifting, so long as you know how to prepare the data and how to set the parameters of the algorithms so they produce valid results.
Three datamining packages in particular are worth noting:
Weka and RapidMiner are GUI-driven toolsets where R is more command-line oriented. You can download all of these and play around with them at home. They’re not toys, so you need to have some confidence about plowing through the user guides and technical manuals, but they are easy enough to get up and running.
The choice between Weka and RapidMiner is a difficult one. At the moment I’m working with Weka but that is mainly because it was the first one I started experimenting with.
The other crucial thing to have at hand is some test data. Of course you may want to remind me that I have several GB of WoW-related data right here. But that is not the place to start. The first step is to learn how the tools work and how to use them. For that we need data where we know the expected results – so that when the tool doesn’t produce the right result we know to look again at how we have applied the algorithm.
The data has to be challenging enough to put the algorithms to a test, but not so challenging that we are left wondering whether the wrong answer is due to the complexity of the data or just due to some dumb mistake.
Over at the Expressive Intelligence Studio blog, Chris Lewis posted an interesting report about using a toon’s gear to predict its class. That’s exactly the sort of place we want to start since all sorts of classifying and clustering algorithms could be tested on a data set like that. The other idea that occurred to me is to use talent builds to predict the spec of the toon. Of course you can do that in a very simple way by just adding up the points spent in each of the three trees: a paladin that has most points in the holy tree is a holy paladin.
Where the problem becomes interesting is with those classes where there is a tendency to spread talent points across more than one tree. I’m thinking mainly of mages and warlocks but any class where the three trees don’t map straight onto the tank/healz/dps holy trinity should see some points spread across multiple trees. Can datamining algorithms handle “fuzzy” data like that?
To make this discussion more concrete, let’s have a quick look at that very question. We can fire up Weka and feed in a sample of level 80 paladin talent builds. To keep it simple, I’m using a toy data set of only 150 paladins with 50 from each of the 3 trees. We can run a basic k-means clustering algorithm over the data, which we hope should produce 3 clusters: one each for the holy, protection and retribution trees. And voilà…
That works because holy paladins don’t spend many points in protection or retribution talents. But for mages, where there is a significant tendency to spend points in more than one tree we get this:
Now the algorithm is flummoxed – putting arcane and frost mages in the same cluster and splitting fire mages into two clusters. So we have a simple data set that is also challenging enough to put these tools to a bit of a test.
I’ve also made a third data set using priest builds. Priests have more talent points invested across the trees than paladins but fewer than mages. Clustering this data set is left as an exercise for the reader…
No, seriously… Anybody who’d like to experiment with these data sets can download them from the links here. They’re in a standard .arff format (really just an annotated csv file) that Weka and RapidMiner both know how to load. Note that I’ve used a “.pdf” extension since WordPress will not allow me to upload arbitrary file types. But if you open them in a text editor you’ll see they are just csv data. Rename the extension to “.arff” and they’re ready to go.
I’ll have a lot more to say about these little data sets in the next few posts.