teach yourself datamining in 21 days

July 5, 2010

Things may look quiet here but behind the scenes I’ve been working on my project to get up to speed with modern datamining algorithms. The first step has been to assemble some sources of information and some tools for the job.

For a textbook, I’ve chosen Introduction To Data Mining by Tan, Steinbach and Kumar. It provides a good overview of the key algorithms, along with important issues like data quality and consistency. It also introduces the maths in a reasonably gentle way.

Fortunately, while it is important to understand how the algorithms work, it is not necessary to work the maths by hand. There are some first class freeware datamining programs available that do all the heavy lifting, so long as you know how to prepare the data and how to set the parameters of the algorithms so they produce valid results.

Three datamining packages in particular are worth noting:

Weka and RapidMiner are GUI-driven toolsets where R is more command-line oriented. You can download all of these and play around with them at home. They’re not toys, so you need to have some confidence about plowing through the user guides and technical manuals, but they are easy enough to get up and running.

The choice between Weka and RapidMiner is a difficult one. At the moment I’m working with Weka but that is mainly because it was the first one I started experimenting with.

The other crucial thing to have at hand is some test data. Of course you may want to remind me that I have several GB of WoW-related data right here. But that is not the place to start. The first step is to learn how the tools work and how to use them. For that we need data where we know the expected results – so that when the tool doesn’t produce the right result we know to look again at how we have applied the algorithm.

The data has to be challenging enough to put the algorithms to a test, but not so challenging that we are left wondering whether the wrong answer is due to the complexity of the data or just due to some dumb mistake.

Over at the Expressive Intelligence Studio blog, Chris Lewis posted an interesting report about using a toon’s gear to predict its class. That’s exactly the sort of place we want to start since all sorts of classifying and clustering algorithms could be tested on a data set like that. The other idea that occurred to me is to use talent builds to predict the spec of the toon. Of course you can do that in a very simple way by just adding up the points spent in each of the three trees: a paladin that has most points in the holy tree is a holy paladin.

Where the problem becomes interesting is with those classes where there is a tendency to spread talent points across more than one tree. I’m thinking mainly of mages and warlocks but any class where the three trees don’t map straight onto the tank/healz/dps holy trinity should see some points spread across multiple trees. Can datamining algorithms handle “fuzzy” data like that?

To make this discussion more concrete, let’s have a quick look at that very question. We can fire up Weka and feed in a sample of level 80 paladin talent builds. To keep it simple, I’m using a toy data set of only 150 paladins with 50 from each of the 3 trees. We can run a basic k-means clustering algorithm over the data, which we hope should produce 3 clusters: one each for the holy, protection and retribution trees. And voilà…

Paladins clustered

That works because holy paladins don’t spend many points in protection or retribution talents. But for mages, where there is a significant tendency to spend points in more than one tree we get this:

Mages clustered.

Now the algorithm is flummoxed – putting arcane and frost mages in the same cluster and splitting fire mages into two clusters. So we have a simple data set that is also challenging enough to put these tools to a bit of a test.

I’ve also made a third data set using priest builds. Priests have more talent points invested across the trees than paladins but fewer than mages. Clustering this data set is left as an exercise for the reader…

No, seriously… Anybody who’d like to experiment with these data sets can download them from the links here. They’re in a standard .arff format (really just an annotated csv file) that Weka and RapidMiner both know how to load. Note that I’ve used a “.pdf” extension since WordPress will not allow me to upload arbitrary file types. But if you open them in a text editor you’ll see they are just csv data. Rename the extension to “.arff” and they’re ready to go.


I’ll have a lot more to say about these little data sets in the next few posts.


8 Responses to “teach yourself datamining in 21 days”

  1. Roben Says:

    I use the Analysis Studio software. It is an easy to use data mining software with 6 regression types (they have an automatic procedure for finding the best regression for a given data set) and other data mining procedures like: Logistic regression, Survival and Time Series analysis.

  2. zardoz Says:

    Thanks, I’ll check that one out as well. Each package does slightly different things which is interesting in itself.

  3. […] teach yourself datamining in 21 days « Armory Data Mining Learning datamining, using the WoW Armory as a data set. (tags: datamining mmo worldofwarcraft statistics ) […]

  4. Hi,
    recently I co-authored a study in which we tested and compared software solutions for data mining. One of the solutions that I examined was KNIME, an open source data mining suite developed at the University of Constance/ Germany. KNIME is based on the eclipse IDE and comes with a nifty GUI which is fun to work with. There are plenty of algorithms implemented and if you are fluent in Java you can add your own ones. I had some stability issues especially when working with large data sets, but KNIME is surely worth trying out.

  5. Glenn Wright Says:

    How’s the project going, to teach yourself data mining? I was considering doing something similar, perhaps with R since I’m already somewhat familiar with that. Fun to do it with WoW 🙂

  6. zardoz Says:

    Slowly alas… I’m just to distracted with other things at the moment. But WoW data is pretty good for datamining experiments since it is “artificial” and contains less variability (despite appearances…) than large data sets from real life. If you’d like some sample data to play with, just let me know.

  7. Glenn Wright Says:

    I take it that you get data on gear and talents, basically?

  8. zardoz Says:

    Plus stats and a few other things like battleground and raid performance, yes. Most of what you see on your character pages in the armoury can be downloaded as XML and processed. I shred the XML and put everything into a relational database, but others play with the XML directly.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s

%d bloggers like this: