## the problem of assessing statistical methods

**A** new arXival today by Abigail Arnold and Jason Loeppky that discusses how simulations studies are and should be conducted when assessing statistical methods.

“Obviously there is no one model that will universally outperform the rest. Recognizing the “No Free Lunch” theorem, the logical question to ask is whether one model will perform best over a given class of problems. Again, we feel that the answer to this question is of course no. But we do feel that there are certain methods that will have a better chance than other methods.”

I find the assumptions or prerequisites of the paper arguable [in the sense of **2**. *open to disagreement; not obviously correc*t]—not even mentioning the switch from models to methods in the above—in that I will not be convinced that a method outperforms another method by simply looking at a series of simulation experiments. (Which is why I find *some* machine learning papers unconvincing, when they introduce a new methodology and run it through a couple benchmarks.) This also reminds me of Samaniego’s *Comparison of the Bayesian and frequentist approaches*, which requires a secondary prior to run the comparison. (And hence is inconclusive.)

“The papers above typically show the results as a series of side-by-side boxplots (…) for each method, with one plot for each test function and sample size. Conclusions are then drawn from looking at a handful of boxplots which often look very cluttered and usually do not provide clear evidence as to the best method(s). Alternatively, the results will be summarized in a table of average performance (…) These tables are usually overwhelming to look at and interpretations are incredibly inefficient.”

Agreed boxplots are terrible (my friend Jean-Michel is forever arguing against them!). Tables are worse. But why don’t we question RMSE as well? This is most often a very reductive way of comparing methods. I also agree with the point that the design of the simulation studies is almost always overlooked and induces a false sense of precision, while failing to cover a wide enough range of cases. However, and once more, I question the prerequisites for comparing methods through simulations for the purpose of ranking those methods. (Which is not the perspective adopted by James and Nicolas when criticising the use of the Pima Indian dataset.)

“The ECDF allows for quick assessments of methods over a large array of problems to get an overall view while of course not precluding comparisons on individual functions (…) We hope that readers of this paper agree with our opinions and strongly encourage everyone to rely on the ECDF, at least as a starting point, to display relevant statistical information from simulations.”

Drawing a comparison with the benchmarking of optimisation methods, the authors suggest to rank statistical methods via the empirical cdf of their performances or accuracy *across* (benchmark) problems. Arguing that “significant benefit is gained by [this] collapsing”. I am quite sceptical [as often] of the argument, first because using a (e)cdf means the comparison is unidimensional, second because I see no reason why two cdfs should be easily comparable, third because the collapsing over several problems only operates when the errors for those different problems do not overlap.

November 4, 2015 at 8:02 am

Whats the problem with boxplots?

November 4, 2015 at 8:29 am

Boxplots essentially sum up to three central numbers, plus two extremes, while giving a 2-d impression of spread. Very often, when plotting the histogram of the corresponding data, the resulting impression and comparison of populations is quite different.

November 4, 2015 at 1:14 am

I would also say that there are basically no tight non-asymptotic theoretical convergence results in computational Bayesian statistics, so I can’t see a meaningful way to rank or compare methods outside of simulation studies. So I can’t quite understand your distaste for them.

There’s a chance of a Peskun ordering in “toy” problems (such as comparing a M-H method with its pseudo-marignal variant) or there is the recurrence time idea of Nichols, Fox and Watt, but I can’t really see that being a successful ranking strategy for even a problem as simple as the type of poisson regression with log-mean modelled by covariates and correlated (CAR) region-based effects that you’d see in disease mapping applications.

Is there a method for comparing MCMC algorithms (or heaven forfend MCMC and non-sampling-based strategies) that doesn’t use simulation studies that I don’t know about? (If it helps, my preferred answer to this question is “yes” for obvious reasons!)

November 4, 2015 at 1:02 am

My somewhat specific experience with people comparing methods is that the original sin is that they don’t compare methods for solving the same problem in the same way. (Literally comparing timings for a method that does Bayesian inference vs one that computes a point estimate is one of my favourite cases)

Or the two methods solve slightly different problems (such as problems with different priors).

Or that the comparison is between one method that is designed to solve a specific problem, and one explicitly designed to solve a different problem.

I could go on (and, for that matter, provide references), but it seems pointless.

While I think working out “how to numerically compare general statistical methods” is a REALLY GOOD IDEA, it wouldn’t have fixed any of these.