back to the future

February 2, 2011

Just a brief note to say that I haven’t abandoned all hope of getting this site going again. Even though I’m not playing MMOs at the moment, it seems a shame to just leave everything sit idle. My basic infrastructure runs without too much effort, so it is no great problem to refresh the data every couple of months.

The main obstacle is that Blizz is now serving the up-to-date data from in HTML format rather than XML. My page-scraping code needs to change to cope with that. Fortunately however the Blizz engineers are serving up valid XHTML, which means that XPath expressions can still be used to extract the data we need.

If I’ve been a good little engineer then only my XPaths need to change and nothing else…

There is a danger that the XPath paths can become more than a bit baroque because they have to navigate through all the HTML markup to get to the data nodes, although there are tricks to get the XPath engine to do a lot of the searching.

Anybody looking for inspiration on how to parse the XHTML should check out these posts by a geek blooger called Kastang.  That’s the method I’ll be using when I get back to all this.

Things may look quiet here but behind the scenes I’ve been working on my project to get up to speed with modern datamining algorithms. The first step has been to assemble some sources of information and some tools for the job.

For a textbook, I’ve chosen Introduction To Data Mining by Tan, Steinbach and Kumar. It provides a good overview of the key algorithms, along with important issues like data quality and consistency. It also introduces the maths in a reasonably gentle way.

Fortunately, while it is important to understand how the algorithms work, it is not necessary to work the maths by hand. There are some first class freeware datamining programs available that do all the heavy lifting, so long as you know how to prepare the data and how to set the parameters of the algorithms so they produce valid results.

Three datamining packages in particular are worth noting:

Weka and RapidMiner are GUI-driven toolsets where R is more command-line oriented. You can download all of these and play around with them at home. They’re not toys, so you need to have some confidence about plowing through the user guides and technical manuals, but they are easy enough to get up and running.

The choice between Weka and RapidMiner is a difficult one. At the moment I’m working with Weka but that is mainly because it was the first one I started experimenting with.

The other crucial thing to have at hand is some test data. Of course you may want to remind me that I have several GB of WoW-related data right here. But that is not the place to start. The first step is to learn how the tools work and how to use them. For that we need data where we know the expected results – so that when the tool doesn’t produce the right result we know to look again at how we have applied the algorithm.

The data has to be challenging enough to put the algorithms to a test, but not so challenging that we are left wondering whether the wrong answer is due to the complexity of the data or just due to some dumb mistake.

Over at the Expressive Intelligence Studio blog, Chris Lewis posted an interesting report about using a toon’s gear to predict its class. That’s exactly the sort of place we want to start since all sorts of classifying and clustering algorithms could be tested on a data set like that. The other idea that occurred to me is to use talent builds to predict the spec of the toon. Of course you can do that in a very simple way by just adding up the points spent in each of the three trees: a paladin that has most points in the holy tree is a holy paladin.

Where the problem becomes interesting is with those classes where there is a tendency to spread talent points across more than one tree. I’m thinking mainly of mages and warlocks but any class where the three trees don’t map straight onto the tank/healz/dps holy trinity should see some points spread across multiple trees. Can datamining algorithms handle “fuzzy” data like that?

To make this discussion more concrete, let’s have a quick look at that very question. We can fire up Weka and feed in a sample of level 80 paladin talent builds. To keep it simple, I’m using a toy data set of only 150 paladins with 50 from each of the 3 trees. We can run a basic k-means clustering algorithm over the data, which we hope should produce 3 clusters: one each for the holy, protection and retribution trees. And voilà…

Paladins clustered

That works because holy paladins don’t spend many points in protection or retribution talents. But for mages, where there is a significant tendency to spend points in more than one tree we get this:

Mages clustered.

Now the algorithm is flummoxed – putting arcane and frost mages in the same cluster and splitting fire mages into two clusters. So we have a simple data set that is also challenging enough to put these tools to a bit of a test.

I’ve also made a third data set using priest builds. Priests have more talent points invested across the trees than paladins but fewer than mages. Clustering this data set is left as an exercise for the reader…

No, seriously… Anybody who’d like to experiment with these data sets can download them from the links here. They’re in a standard .arff format (really just an annotated csv file) that Weka and RapidMiner both know how to load. Note that I’ve used a “.pdf” extension since WordPress will not allow me to upload arbitrary file types. But if you open them in a text editor you’ll see they are just csv data. Rename the extension to “.arff” and they’re ready to go.


I’ll have a lot more to say about these little data sets in the next few posts.

pet sounds

July 3, 2009

Here’s news for Hunters:  Blizz has added your pets and their talents to your character talent page in the armoury.

That will give us a lot of new insight into the types of creatures that make the most popular pets, and how players spend pet talent points. The XML is straightforward and contains no quirks and so I’m cranking up the SQL editor as we speak.

I’ll try and have the new reports ready before the next refresh of my data. As to when that will be, well, about a week after Patch 3.2 drops. As to when that will be… well…

Over at Wowenomics, Gevlon from the Greedy Goblin left a comment about datamining that is worth replying to. Basically, his point is that sites like this one report on what the average player is doing, but that is not much use because the average player is only making average choices. (…or at least that’s the polite paraphrase of his point.)

In fact the data shows that this sort of argument is not true. Let’s start from what the data actually looks like. Here, I’ve taken an example picked at random: all choices made by 69 DKs for the chest slot:

Item Count
Mightstone Breastplate 2421
Battlemaster’s Breastplate 1010
Scavenged Tirasian Plate 996
Adamantite Breastplate 898
Murkblood Avenger’s Chestplate 655
Gorge’s Breastplate of Bloodrage 309
Battle Leader’s Breastplate 208
Saronite War Plate 197
Fel Iron Breastplate 141
Unscarred Breastplate 136
Westguard Armor 78
Coldrock Breastplate 75
Azure Chain Hauberk 71
Durotan’s Battle Harness 70
Baleheim Armor 68
Light-Touched Breastplate 63
Breastplate of the Warbringer 48
Andrethan’s Masterwork 46
Bone-Threaded Harness 44
Segmented Breastplate 32
Conqueror’s Breastplate 30
Blacksoul Protector’s Hauberk 26
Bloodfist Breastplate 23
Chestguard of Illumination 21
Nether Protector’s Chest 21
Vest of Vengeance 21
Soul Saver’s Chest Plate 21
Scavenged Breastplate 18
Chestguard of Salved Wounds 18
Breastplate of Blade Turning 18
Light-Bound Chestguard 17
Warmaul Breastplate 17
Breastplate of Retribution 15
Blackened Chestplate 15
Boulderfist Armor 14
Heavy Earthforged Breastplate 14
Shattered Hand Breastplate 12
Talonguard Armor 12
Reaver Armor 11
The Exarch’s Protector 9
Lost Chestplate of the Reverent 8
Bloodscale Breastplate 8
Gilded Crimson Chestplate 7
Chestplate of A’dal 7
Jerkin of the Untamed Spirit 7
Khan’aish Breastplate 7
Redeemer’s Plate 7
Torn-heart Family Tunic 7
Protectorate Breastplate 6
Marshwalker Chestpiece 4
Bogslayer Breastplate 4
Demon-Forged Chestguard 4
Warden’s Hauberk 4
Shamblehide Chestguard 3
Garmaul Chestpiece 3
Elegant Dress 2
Crimson Mail Hauberk 2
Cenarion Thicket Jerkin 2
Ango’rosh Breastplate 2
Azure Silk Vest 1
Acherus Knight’s Tunic 1
Black Mageweave Vest 1
Bonechewer Berserker’s Vest 1
Corsair’s Overshirt 1
Darkcrest Breastplate 1
Demon-Forged Hauberk 1
Drakescale Breastplate 1
Breastplate of Many Graces 1
Chestguard of the Dark Stalker 1
Chestguard of the Stormspire 1
Chestguard of the Talon 1
Farshire Robe 1
Flimsy Chain Vest 1
Lovely Black Dress 1
Lovely Blue Dress 1
Simple Black Dress 1
Skom Chain Vest 1
Scale Brand Breastplate 1
Refuge Armor 1
Runecloth Robe 1
Nexus-Strider Breastplate 1
Spring Robes 1
Tuxedo Jacket 1
Twilight Cultist Robe 1
Warrior’s Embrace 1
Worgblood Berserker’s Hauberk 1
Wrathfin Armor 1

It’s worth charting this distribution too, since the shape of the distribution curve is important:

69 DK chest item choice distribution.

69 DK chest items - distribution.

Log plot 69 DK chest item choice.

69 DK item distribution - log scale

The error in Gevlon’s argument stems from our common-sense understanding of average. Most often we think in terms of a Gaussian distribution – so often that it is actually called a normal distribution. When events or things are distributed normally, then the average outcome is, well, average. But, as various people have observed, when networking effects come into the picture, the typical distribution is not Gaussian but power-law-like. The majority cluster around a very few choices, with a rapid fall-off into a long tail of more funky choices but where each choice in the long tail made by only a few individuals.

Now WTF does all that mean in plain English? Simple. If WoW gear, gems or enchant choices were normally distributed, a few people would make the best choice,  a few people would make the worst choice and most would make a so-so choice. We would expect to see a few 69 DKs with the Uber Breastplate of Pwnage, a few with the Scruffy Tunic of Suckage, but most would be wearing the Mediocre Breastplate of, um…, Mediocrity. And that’s what my report would find for you.

But you can see from the charts that the data looks nothing like a normal distribution. Most players have in fact made the same few choices – which generally represent a trade-off between how powerful the item is and how easy it is to get hold of.  Those people who don’t follow the crowd are out in the long tail – here the picture is murky because we don’t know whether they are there because of ignorance or whether they have hit on some effective but as-yet unknown (or difficult to obtain) solution to the problem. (And some are out there because the data is capturing multiple playstyles – no doubt those toons wearing tuxedo jackets and lovely blue dresses are being played by people who know exactly what they are doing.)

But for our purposes, averages are just what we want – they show the consenus view across the player base on what are the reasonably sensible and effective choices.

To me, the interesting question is how this consensus forms. Undoubtedly Gevlon’s point has an element of truth – the average WoW player is no theorycrafter. But there are feedback mechanisms that shape their choices. They have the game itself. And they have instant access to the collective wisdom of the player-borg’s vast hive mind thanks to all the commentary and guides here in cyberspace. It is these network effects that make the distribution take the shape that it has.

You may not have noticed yet, given all the 3.1 (and now 3.1.1!)  fun, but a whole slab of character stats have disappeared from the armoury. The most interesting ones that have gone are the detailed BG performance stats. Strangely enough, the raiding performance numbers are still there – the missing ones all seem to be PvP related.

Now if I were a, like, y’know, paranoid kinda person, I’d be putting forward the following conspiracy theory. PvP stats are about class performance and raid stats are about group performance. Raid stats tell you something about a guild because performance depends on the ability of the guild to coordinate, to lead, to control the Leeroy Jenkins element etc etc. And indeed there are websites out there that do exactly that:  rate guilds by how many, and which, raid dungeon bosses they’ve downed.

But PvP stats, being all about the mano a mano thing, tell you something about the relative performance of classes. And that subject really does seem to be Blizz’s bête noire these days. Just exactly why they’re so focused on it escapes me – but then I never go anywhere near the official fora so maybe that’s why I’m in the dark.

My conspiracy theory would be that they don’t want anybody to be able to just run class balance through the ol’ spreadsheet to see what comes out the other end.

I made a modest contribution to using the BG data to look at class balance here on this blog. But I believe I was pretty careful to say, like dude, we don’t expect the classes to be balanced across any narrow set of performance measures. Classes that can tank and CC and heal are people too.

But anyway I’m not the paranoid type, so that’s enough of that. It’s just a silly game. Let’s move on.

Just to complete the picture, here is a set of battleground class performance charts for each of the x9 BG levels plus level 80. The y-axis now shows average deaths per game and not the inverse, so the sweet spot of high-kills-low-deaths is in the bottom right hand corner of the chart.

The sample consists of all players at each level who have played 100 or more BGs. The data is from patch 3.0.9.

There’s a lot of interesting things to note in those charts, especially when you compare the same class at different levels. Some are effective at all levels, others appear to change roles as they level up.

If you want the executive summary, these are the points that strike me:

  • DKs are OP
  • Rogues aren’t, even though people think they are
  • Warriors are fragile, despite all that armour
  • Warlocks are still a force in PvP as long as you don’t mind dying a lot
  • Baby Paladins may be easy meat but the adult of the species sure isn’t
  • Hunters seem to be the consistent high performer, but that is probably because they just play the same role (of ranged attacker) at every level

Do exercise some caution when interpreting these results. In particular remember:

  1. Some classes have fewer attacking players and therefore a lower average kill rate just because they have a healing tree. Other classes may spend a lot of time CC-ing instead of attacking.
  2. Some BGs have objectives that conflict with straight PvP. For example in Warsong Gulch, the classes that spend most time running the flag will have a lower kill rate because of that.
  3. This is data aggregated across every BG accessible at the level. There may be specific features of individual BGs that make certain classes more effective there, despite these charts.



BG Class Effectiveness, Level 19



BG Class Effectiveness, Level 29


BG Class Effectiveness, Level 39

BG Class Effectiveness, Level 39


BG Class Effectiveness, Level 49

BG Class Effectiveness, Level 49


BG Class Effectiveness, Level 59

BG Class Effectiveness, Level 59


BG Class Effectiveness, Level 69

BG Class Effectiveness, Level 69


BG Class Effectiveness, Level 79

BG Class Effectiveness, Level 79


BG Class Effectiveness, Level 80

BG Class Effectiveness, Level 80

One annoying problem for armoury mining is that the armoury servers do not give the name of the enchant attached to an enchanted item. What you get are the bonuses granted by the enchant, along with a magic number key that references some internal Blizz data source.

The trick is to find the enchant that goes with the bonus, since the thingy that grants the bonus is what matters to players.

So we get bonus strings like +10 Defense Rating/+10 Stamina/+15 Block Value. But that’s not a lot of use unless I can tell you how to get these bonuses. What you need to know is that Presence of Might grants this. But nothing in the armoury gives us that link.

I must admit that I’d basically filed this one away in the too-hard basket – that’s why there is no enchant data on this site. But now my plan to do twink analysis means that I can no longer ignore the problem – it’d be a pretty tenth-rate twink that wasn’t enchanted up to the gills.

Fortunately, over at Armory Musings, Okoloth came up with an ingenious solution to this problem. You can read all about it here, but basically what he’s doing is searching various sites in the WoW datasphere for the bonus strings, and finding the matching enchant or item. Brilliant!

Okoloth produced an XML file that maps the enchants-to-bonuses he was able to discover. You can get that XML file from here. But being a do-it-yourself kinda bloke, I thought I’d have a go at the problem myself and see what I could do.

Now, when you mention search, one word pops into my mind straight away – starts with ‘G’… Fortunately, Google offers a way of using their data from software. It is possible to write a program that will get Google to search for a phrase, and return the search results in a structured form that in turn can be processed locally. In particular, you get the URL of the page, which is what we want for the link, and the title of the page – which should,with a bit of luck, be the name of the enchant spell or item. The rest of the page indexed by Google can be ignored – that’s the two pieces of data we’re after.

A bit of string-searching on the returned results for a well-known WoW database site like Wowhead or Thottbot is all that is required. The correct search results can be identified and written out to XML or whatever.

Google has a nice REST API, intended for embedding in webpages, but easily callable from your favourite programming or scripting language. There is even an open-source C# wrapper for those of us trapped in Billg-land. If you want to repeat this exercise from an MS environment, I highly recommend this, which hides all the googly grunge.

It didn’t take long to write the code and I was ready to boldly go where only one armoury datamining site had gone before. And it worked pretty well too; with a being as omniscient as Google on my side, how could I possibly lose?

Not every enchant bonus string is accurately found. Some are too vague. +2 Fishing gets a lot of hits on bait-and-tackle shops, for example. A few end up finding no matches for reasons I can’t quite fathom. But the vast majority of bonuses find a matching enchant spell or item, and the rest can be fixed up by hand.

Those ones that my software couldn’t match can generally be found by a manual search because they are buried away in tables such as those at WoWWiki.

The other problem is that a lot of enchants have links to both spells and items. Generally what we want is an item, if one exists, since that is what you have to obtain ingame to start the process. But that is not a fatal problem, since the spell-to-item cross-links can be found in the WoW database sites themselves. All I need to give you  is one link to Wowhead. When you get there, you’ll find the spells and items linked together and you can sort it all out quickly.

Okoloth made his XML file freely available so it would be remiss of me not to do the same. For some mad reason I can’t post XML here, so I’ve renamed the extension to .doc. But if you look inside you’ll see it is well-formed XML; just remove the .doc extension and you’re good to go. You can download it here: {link temporarily removed by me – see update}.

Thanks again to Okoloth for solving this irksome little problem!

UPDATE: Not quite there yet – I’ve removed the xml file download as the quality isn’t quite up to scratch. I’m still confident that this method will work but there are a couple of issues to work through before I release it.