back to the future

February 2, 2011

Just a brief note to say that I haven’t abandoned all hope of getting this site going again. Even though I’m not playing MMOs at the moment, it seems a shame to just leave everything sit idle. My basic infrastructure runs without too much effort, so it is no great problem to refresh the data every couple of months.

The main obstacle is that Blizz is now serving the up-to-date data from battle.net in HTML format rather than XML. My page-scraping code needs to change to cope with that. Fortunately however the Blizz engineers are serving up valid XHTML, which means that XPath expressions can still be used to extract the data we need.

If I’ve been a good little engineer then only my XPaths need to change and nothing else…

There is a danger that the XPath paths can become more than a bit baroque because they have to navigate through all the HTML markup to get to the data nodes, although there are tricks to get the XPath engine to do a lot of the searching.

Anybody looking for inspiration on how to parse the battle.net XHTML should check out these posts by a geek blooger called Kastang.  That’s the method I’ll be using when I get back to all this.

Advertisements

the talented Mr Druid

July 28, 2010

Now that we’ve got a dataset which can give us feral druids who:

  • are consistently geared and spec-ed and
  • are serious participants in instance-running and raiding

then we can move to the next stage: trying to find ways to partition that set into bears and cats.

This is where the visualization tools in a datamining package really come into their own. We can add a third dimension to any cluster by using colour. And we can quickly iterate through all the data dimensions to see which ones produce the best clusters. In these charts I’m filtering out all ferals druids with spellpower gear and all who have run fewer than 75 instances or raids.

I’m still plotting health vs mana, to keep the charts consistent across posts, but we’re getting close to the point where we will have to find different stats to graph. We know now that mana is irrelevant and health only a partial indicator of tank-ness. But for the time being, the main cluster that results from that plot is good enough.

Now we want to know which talents can partition the cluster. (And we could ask the same question of glyphs or character stats too.) How about Primal Gore? This is the result – not a lot of partitioning going on there:

Primal Gore - not effective in clustering

Thanks to various comments, it’s clear that there are a set of talents which people expect to effectively partition the cluster. Popular suggestions have included: Thick Hide, Natural Reaction and Protector of the Pack for bears and Shredding Attacks, Predatory InstinctsKing of the Jungle, Survival Instincts and Natural Shapeshifter for cats.

Now we can look at each of those in detail:

Bear:

Protector of the Pack cluster

Natural Reaction cluster

Thick Hide cluster

Cat:

Survival Instincts Cluster

Shredding Attacks cluster

Predatory Instincts cluster

Natural Shapeshifter cluster

King of the Jungle cluster

You can see that some talents appear to be better than others at defining two distinct clusters. They all have a bit of partitioning effect, but some are better than others at producing the largest “distance” between the two clusters. Predatory Instincts produces clear gold and light blue clusters but Natural Shapeshifter produces more of a greenish middle ground which means that many players in both camps have put a point or two into it.

Datamining clustering algorithms work by calculating “distances” between data points along each of the data dimensions then aggregating those distance measures across all the dimensions. For example, the distance between a toon which has 3 points in Thick Hide and a toon which has zero points in the talent could be measured as “3” and then a sum of all distances could produce a measure of how distinct one toon is from another (although the algorithms generally use more sophisticated maths than just that.)

So we want the talents with the greatest distance between the two clusters. You can have a look at the charts and see which ones you think are the best ones. I’ll put up my numbers on that in the next post.

Now if we use the better of those talent dimensions as inputs to our clustering algorithm we get this:

Clusters in five talent dimensions.

The crucial thing here is that the blue cluster, which are the toons with bear-ish talents, extends right along the health x-axis. No doubt serious tanks are picking gear, gems and enchants that boost health. But since we are looking for a count of all tanks, all the way from those running 5-toon instances to those in the endgame raids, we should expect that there will be a wide spread of health between those just starting out and those nearer the end of the raiding dungeon chain.

That’s one reason why I’m about to abandon the health vs mana thing and move onto other character stats. More about that in the next post. But the reason we can make decisions like that is due to the insights that this data visualization gives us.

the truth is in there

July 20, 2010

Thanks to all the correspondents who commented on my feral druids datamining experiment. I’m happy that I’ve got a reasonable estimate for the number of bear tanks now. But I’m holding back from putting up my final word on the subject since I’m trying to encourage a couple of people to write up their own analysis first.

You may recall we left off with a simple graph of health vs mana for level 80 feral druids that produced two very distinct clusters – a red and a blue one – sorta like one of those political maps of the USA except with all the republicans and democrats clumped together in separate parts of the country.

And the key question was… um… While we’re on that subject… Can anybody explain to me why those political maps always colour the conservatives red and the liberals blue? It’s very confusing to a foreigner since just about everywhere else in the world, red is associated with the left or progressive side and blue with the Tory or conservative side.

Remember that great movie from the Reaganite ’80s? It was Red Dawn, not Blue Dawn. But I digress…

And the key question was: what were those red ferals doing at the high mana end of the scale? I had my doubts that there could be so many toons carrying mismatched specs and gear. But I’ve been convinced that, yes, there is something not quite right there. A simple filter that drops those toons in the sample with significant spellpower gear basically makes the red cluster disappear.

Now that might not sound like progress – ending up with one cluster – but don’t forget that the power of these datamining algorithms is that they cluster in multiple data “dimensions”. To the eye, there is one cluster, because we are drawing the graph in two “dimensions”: health and mana.

But as soon as we add some talent and glyph dimensions, the big blue blob starts to break up into separate clusters. And this time, there is a good match between the talents and glyphs that we expect to distinguish cats from bears and the actual location of each cluster in the multi-dimensional space.

But it’s a whole lot easier to show you than to tell you, so I’ll leave you with a simple illustration of how that all works. We can add a third dimension to the graph by using colour. The datamining packages that I’m playing with are very good at that sort of visualization, as you can see here.

With the spellpower toons gone, the high mana group has also mostly gone  and the shape of the blue cluster has become clearer as the graph scale has changed. Then we overlay, say, a cat glyph:

Feral Druids with Glyph of Shred

and a bear glyph:

Feral Druids with Glyph of Maul

and the clusters within the cluster become pretty clear. Thanks again to Narkondas for the key clues that inspired those graphs.

The datamining algorithms will generate a count of the toons in each cluster, but I’ll leave that till the next post. But as you can imagine, with a big clump of ferals filtered out, then the percentage of bear tanks in the overall mix is getting smaller.

I should also say that I’m about to collect a new data set and update my armoury reports since the data is getting a bit old and stale. As usual that will take a week or so.

Things may look quiet here but behind the scenes I’ve been working on my project to get up to speed with modern datamining algorithms. The first step has been to assemble some sources of information and some tools for the job.

For a textbook, I’ve chosen Introduction To Data Mining by Tan, Steinbach and Kumar. It provides a good overview of the key algorithms, along with important issues like data quality and consistency. It also introduces the maths in a reasonably gentle way.

Fortunately, while it is important to understand how the algorithms work, it is not necessary to work the maths by hand. There are some first class freeware datamining programs available that do all the heavy lifting, so long as you know how to prepare the data and how to set the parameters of the algorithms so they produce valid results.

Three datamining packages in particular are worth noting:

Weka and RapidMiner are GUI-driven toolsets where R is more command-line oriented. You can download all of these and play around with them at home. They’re not toys, so you need to have some confidence about plowing through the user guides and technical manuals, but they are easy enough to get up and running.

The choice between Weka and RapidMiner is a difficult one. At the moment I’m working with Weka but that is mainly because it was the first one I started experimenting with.

The other crucial thing to have at hand is some test data. Of course you may want to remind me that I have several GB of WoW-related data right here. But that is not the place to start. The first step is to learn how the tools work and how to use them. For that we need data where we know the expected results – so that when the tool doesn’t produce the right result we know to look again at how we have applied the algorithm.

The data has to be challenging enough to put the algorithms to a test, but not so challenging that we are left wondering whether the wrong answer is due to the complexity of the data or just due to some dumb mistake.

Over at the Expressive Intelligence Studio blog, Chris Lewis posted an interesting report about using a toon’s gear to predict its class. That’s exactly the sort of place we want to start since all sorts of classifying and clustering algorithms could be tested on a data set like that. The other idea that occurred to me is to use talent builds to predict the spec of the toon. Of course you can do that in a very simple way by just adding up the points spent in each of the three trees: a paladin that has most points in the holy tree is a holy paladin.

Where the problem becomes interesting is with those classes where there is a tendency to spread talent points across more than one tree. I’m thinking mainly of mages and warlocks but any class where the three trees don’t map straight onto the tank/healz/dps holy trinity should see some points spread across multiple trees. Can datamining algorithms handle “fuzzy” data like that?

To make this discussion more concrete, let’s have a quick look at that very question. We can fire up Weka and feed in a sample of level 80 paladin talent builds. To keep it simple, I’m using a toy data set of only 150 paladins with 50 from each of the 3 trees. We can run a basic k-means clustering algorithm over the data, which we hope should produce 3 clusters: one each for the holy, protection and retribution trees. And voilà…

Paladins clustered

That works because holy paladins don’t spend many points in protection or retribution talents. But for mages, where there is a significant tendency to spend points in more than one tree we get this:

Mages clustered.

Now the algorithm is flummoxed – putting arcane and frost mages in the same cluster and splitting fire mages into two clusters. So we have a simple data set that is also challenging enough to put these tools to a bit of a test.

I’ve also made a third data set using priest builds. Priests have more talent points invested across the trees than paladins but fewer than mages. Clustering this data set is left as an exercise for the reader…

No, seriously… Anybody who’d like to experiment with these data sets can download them from the links here. They’re in a standard .arff format (really just an annotated csv file) that Weka and RapidMiner both know how to load. Note that I’ve used a “.pdf” extension since WordPress will not allow me to upload arbitrary file types. But if you open them in a text editor you’ll see they are just csv data. Rename the extension to “.arff” and they’re ready to go.

ClassByGear.arff
PaladinBuilds.arff
MageBuilds.arff
PriestBuilds.arff

I’ll have a lot more to say about these little data sets in the next few posts.

So I’m working away here on my little teach yourself datamining in 21 days project. One thing that is distracting me is that I’m becoming aware of the possibilities of datamining not just games but RL™ itself.

The rise of the social web is making an enormous amount of data about the average person in the street available to be mined. We seem to be at some tipping point just now, where serious work is starting to be done on this (leaving aside the question of whatever serious datamining work has been done by various shadowy arms of government).

And, for better or worse, the data could be a lot more revealing than the sort of insights your supermarket might be collecting about you. They just pay people to figure out whether you’re more likely to buy beer or soft drink with your instant noodles. The stuff that’s potentially available from Twitter, Facebook and its like, and from the various tentacles of the googlepus makes that seem like small change indeed.

One of my favourite sites at the moment is floatingsheep. Highly worth a look if you’re a dataphile at heart.

You can see from that site that concerns about privacy can be greatly exaggerated. Datamining is more interested in social trends than what the individual might be up to. The same principle applies here: my reports deal with averages – what 100K gnomes do with their time, rather than what any single gnome might be doing – but it would be easy to drill down much lower if I felt it was worth it. If I collected the achievements data and tracked specific guilds or toons to-ing and fro-ing across Azeroth, the analogy would be even clearer.

There must be a point where people beyond the alfoil hat brigade start to get a bit upset by some of the implications of social datamining. Don’t forget that Blizz had to remove all the monetary data from the armoury statistics – and there’s no sign of that data coming back – and that’s just about a game. We saw something of what might happen in the fuss over Google Buzz. How long before we see more general concerns surface out there in Web-2.0-land?