druid cats and bears again

March 2, 2010

This is a guest post by Darush.

A while back Zardoz posted an estimate of feral druid builds (bears vs. cats). The original analysis was based on human expertise: bears usually choose certain talents while cats choose others. However, one can take another approach. There are inherent differences in the builds for each spec, and computational methods that know nothing about World of Warcraft might able to identify these differences by examining the raw data. Armed with Matlab and some free time, I tried to explore the great kitty question.

A quick note about myself: I have been playing World of Warcraft since Wrath of the Lich King was released. I currently have two characters on Cairne, Darush (a hunter) and Azabroth (a priest). Outside of the game, I’m a PhD student in computational biology, and my thesis deals with identifying subpopulations in cell data. Interestingly, one can easily replace “cell data” with the druid data Zardoz mined from the armory.

For those who would just like the executive summary: Out of the data Zardoz sent me, 27% of the toons could be bears and 65% could be cats. Be aware that there are some ifs-and-buts associated with these numbers and they should be treated with due caution.

Now for the details:

I started with a very large table that included the agility, stamina, dodge rating, and talent points allocation for 28,970 toons who had a Feral build (either active or not). I initially wanted to assess how well agility and stamina predict whether a toon is a cat or a bear. Plotting stamina vs. agility, I see the following:

First of all, notice that annoying flat line at the bottom. These are toons with low agility, which means they’re not wearing feral gear in their feral spec. Since I don’t want to confuse my algorithm, I removed these. I ended up with 20,365 toons and the following plot:

Two separate populations emerge. There are the high stamina/low agility toons, which I’m guessing are mostly bears, and the high agility/low stamina toons, which are probably mostly cats. Unfortunately, there is a huge overlap, especially in the lower attributes range. It appears that stamina and agility will not be enough to decide who is what.

Now, for some magic! I used an algorithm called PCA (principal component analysis). In a nutshell, PCA attempts to score the variability in the data. The algorithm takes a long list of numbers (in this case, I used the talent builds for each toon) and outputs a series of scores for each toon. Each number in the series is a component; the first number for each toon is the first component, the second is the second component, etc. Intuitively, PCA compresses the data by discarding less variable information.

(Please notice that this is a very hand wavy explanation. Apologies to all mathematicians, physicists and computer scientists in the crowd.)

After running PCA using just the talent specs, I looked at the first and second components, and ended up with this plot:

We can clearly see two populations. I guessed that one was bears, the other cats. Usually at this step I will run another fancy algorithm to actually identify these automatically, and score how different they are; such algorithms are called clustering algorithms, and each group is a cluster. I have chosen the lazier path for the purpose of this analysis, and decided on two arbitrary lines:

The top green cluster has 5,571 toons. The bottom one has 13,226. There are 1,568 blue toons, which are undecided (7.6% of data).

It’s time for a guessing game. I am guessing one of these groups is bears and the other is cats. It is quite possible that I am completely mistaken, but since this is one of the major differences between feral druids (and the initial purpose of my analysis), I decided to give it a shot. If I am correct, then which one of these is the bear cluster, and which one is the cat cluster? I decided to do some more speculation. Let’s examine the stamina vs. agility plot again:

It appears that toons around black line #1 are mostly bears and toons around black line 2 are mostly cats. Again, as mentioned previously, there is a huge overlap. Fortunately, the overlap ends at some point. We can guess that points to the right of red line 1 are “definitely bears”, since they have very high stamina, while points above red line 2 are “definitely cats”, since they have very high agility.

What shall we do next? On one hand, from the PCA analysis (only talent spec) we have the top green cluster (which I called c1) and the bottom green cluster (c2). From the stamina vs. agility plot, we have the “definitely bear” and the “definitely cat” toons. We can now ask the following four questions:

  1. Does C1 have many “definitely bears”?
  2. Does C1 have many “definitely cats”?
  3. Does C2 have many “definitely bears”?
  4. Does C2 have many “definitely cats”?

We answer these using a statistical test called a hypergeometric test, also called a one-tailed Fisher’s exact test. The answers are yes, no, no, yes. Therefore, we can say that c1 is highly enriched for bears while c2 is highly enriched for cats.

(For the statistically inclined, all four p-values are lower than 10-20)

It is time for some truth in advertising. If I will present my thesis adviser with this analysis, she will probably hang me, rez me, hang me again, and then /gkick me out of my PhD program. There is much unsubstantiated guesswork involved, a mix of rules of thumb and hunches. The good news is that Zardoz manually examined some of the toons in c1 and c2, and did not identify any contradictions with my analysis.

And, of course, both of us will be glad to hear any remarks and comments.


10 Responses to “druid cats and bears again”

  1. Eq Says:

    Cool work

    Something to take into consideration, I have feral druid and I have both a tank and a dps spec. But since my gear is totally OP for Heroics I usually switch some tanking gear for dps gear, this also helps with keeping aggro now that DPSers are doing insane dmg as well.

    You might wanna check the avg ilvl of the gear and if these are mainly in the uncalled regions.

    Another way of approaching is only looking at gems, I dont think that any cat will put any full stamina gem in their set, as well as a tank wont put an AP gem in their gear.

    If they have both its clearly a hybrid, or they have both specs. And also keep an eye out for people wearing pvp gear.

  2. tankadin Says:

    One of the problems that I see is that higher level gear tends to have more stamina on it. So it’s very possible that a cat with high level ICC gear will have more stamina than a new level 80 bear in level 200 Naxx gear. This will confuse your results.

    I’m wondering if there’s any way to “normalize” gear to take this into account.

    Another way might be to examine the characters in more narrow bands where you separate them into groups based on a tight range of average gear ilevel.

    Another option might be to put a floor on your data and exclude anyone with an average item level less than, say, 232-ish. It might make your results a little clearer.

    In a semi-related note, I’ve started playing around with protret on my paladin. I use my normal protection tank build, but I wear mostly retribution gear. It’s popular in PvP and I’ve been using it when I want to run heroics as DPS. The classic “signature” for protret is a protection build and lots of strength gems.

  3. Nelson Says:

    Great work! Does PCA tell you anything about the meaning of Component 1 and Component 2? Ie: is there some specific talent that everyone in Component 1 has but no one in Component 2, or some general trend in talents? Or is it muddier than that.

    One explanation for the fuzzy data is some feral druids are both cats and bears. Back when I was playing it was pretty common for me to be wearing a cat suit in a bear spec, or wearing bear gear in a hybrid cat/bear spec. Neat to see the statistical tools can pick a pattern out.

  4. […] at the Armory Data Mining blog, a plucky computational biology PhD student under the name of Darush has taken a look at some World […]

  5. Mania Says:

    Fascinating! Thank you for taking the time to do this analysis, Darush.

  6. Chris Lewis Says:

    @Nelson: PCA is much muddier than that; without getting too technical, it munges the different features together to try and come up with components that express the variability of the data. The original features are lost, so there is no meaning in the labels of Component 1 or 2, just that the data distribution indicates that there is some clustering.

  7. […] druid cats and bears again « Armory Data Mining […]

  8. […] “It is time for some truth in advertising. If I will present my thesis adviser with this analysis, she will probably hang me, rez me, hang me again, and then /gkick me out of my PhD program” – Armory Data Mining. […]

  9. paul Says:

    Hey, THis is fun – what a great way to pactise all the skill set you neeed for you PhD, while exploring the details of somehting you love, i.e. WOW!

    I wish my studetns were this creative at dinging datasets.

  10. V Says:

    Hey zardoz,

    I would love to get a copy of this dataset. How can I get it?



Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: