#502 - June 18, 2012, 8:36 p.m.
I disagree. Simulation and formulation need to be debugged, yes, but if you don't like looking at math, what would constitute as evidence? Math can, has been, and will be improperly constructed, but without some logical/rational evidence qualitative data means nothing. It's just hearsay.
We like looking at math. You wouldn't last very long in this gig if you didn't.
I know it seems like a Catch 22, but really I think it just illustrates that estimating DPS is not at all straightforward. If it was even moderately easy to do, everyone (including us) would know exactly what all of the targets were and how far everyone deviates from those targets.
There are two problems with simulations. One is that we just don't know if they are accurate. We could determine if they are accurate, but that takes a lot of our time (which is what I meant by debugging). In my experience, they tend to be as accurate as the dedication of the the group of players working on them. What this tends to mean is that certain specs are modeled very well and others... less so. That changes over time as the commitment of key players waxes and wanes. Several systems designers got their start in the theorycrafting community. It is one we are familiar with.
The second problem with simulations is they assume perfect gameplay on a static boss. Now on the one hand it makes more sense to do that than it does to compare DPS on every boss. It's often hard, even for the community, to decide when a fight has a gimmick versus being a legit comparison fight. On live, Ultraxion is the closest thing to a one-target fight with no movement, but even then it sprays the warriors with rage and the length of any fight has relevance because of how many times everyone's cooldowns are available. On the other hand, players tend to care more about how they can actually perform on a fight, not how they could theoretically perform on a boss that doesn't exist (unless you are in Naxxramas perhaps). As an aside, it's fun to go back and compare some of the predicted simulations for various tiers with actual parses.
I mentioned already the problem with comparing parses. They are reasonably good for comparing say rogues to mages, but terrible at comparing the various specs of rogues. The reason is because of the sampling bias, where all of the good rogues (and many of the not-so-good ones) swap specs to whatever spec the highest DPS parses use. There are some problems with this. Some of the highest DPS rotations are challenging to execute. Just because the best rogue in the world hit those numbers doesn't mean you ever can. It also doesn't mean that your DPS will go up by a certain percent just because you used his spec. Some fights just work out really well for some specs, because of adds or movement or burn phases or fight duration. Are these gimmicks that should be tossed out? At some point, you're tossing out every fight. Now to be fair, I'm not saying that Subtlety is secretly the highest DPS fight on Ultraxion and nobody knows about it. However, the delta between Sub and Combat is probably smaller than most people think. The sample size for Sub is much smaller and presumably its numbers are diluted by a lot of uninformed or lesser geared or skilled players or other people just messing around.
The best test I can come up with is to do a fight as one spec and then do the fight as another spec, trying to keep all other factors equal, and see how you do. That might be feasible in Dragon Soul today where you're likely looking for something to spice things up a bit. On the other hand, your guild leader might not want to hear on cutting-edge content "Hey, mind if I try an experiment?" :)
As I said, it's hard. I have used the thermometer analogy before. If you want to know the temperature outside, you can use a thermometer (or more realistically, a phone app) and be pretty confident that the number you see is reasonable (72 degrees F / 22 C in SoCal at the moment, as it very often is.) There isn't a thermometer for rogue DPS. There are a lot of stats and a lot of estimations that when taken collectively can give you some idea of where things lie.
Nevertheless, if you have any numbers from beta that suggest one spec is below the others, please share them with us. We are comfortable with our testing mechanics, but they have been wrong before, and reconciling them with the numbers from other people is never a bad idea. It's really not possible to have useless data, as long as you take everything relevant into consideration as part of the analysis.